mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines
Apache License 2.0
344 stars 7 forks source link

Add optimized file access that also works for non-seekable files #13

Closed mxmlnkn closed 1 month ago

mxmlnkn commented 1 year ago

There are some use cases that want to create the index while downloading, so something like wget | tee downloaded-file | pragzip --export-index downlaoded-file.gzindex.

I think this should be solvable with a new non-seekable FileReader derived class with these two main ideas:

Assuming that https://github.com/mxmlnkn/ratarmount/issues/106 is caused by the non-sequential file access, then this addition might also fix that.. Assuming I could use the non-seekable file reader during index creation. From the outside I know that I only have to go over the file sequentially but I would also require a Python interface reflecting that. It might be easier to do the gzip index creation as a separate step / pass. This way I could also implement a version that does not actually decompress anything but only gathers index seek points. That way memory usage would be limited even without implementing #2! This idea would be implementable with the refactoring done for #11. It just needs another specialized ChunkData subclass.

mxmlnkn commented 1 year ago

See https://github.com/mxmlnkn/indexed_bzip2/tree/single-pass-reader for an attempted implementation. I think I stopped working on that branch because the performance didn't look very good but still much better than single-threaded gzip. Also has lots of todos in the code comments in the last commit.

mxmlnkn commented 1 year ago

Assuming that https://github.com/mxmlnkn/ratarmount/issues/106 is caused by the non-sequential file access, then this addition might also fix that.

We also need to detect when to use the sequential access pattern and when we need to use pread because, for SSDs and others, pread will be faster than sequential reads!

mxmlnkn commented 9 months ago

Partly fixed with https://github.com/mxmlnkn/indexed_bzip2/commit/9502781a9d2fa8b265fbe0a3b42177a1d5406836, but automatic detection is not yet implemented. Instead, I'm currently shooting for an --sequential or similar option. Maybe even make the pread the non-default case because sequential access probably is faster even on SSDs. It probably only is necessary to reach peak speeds for files on /dev/shm. Note that this does not affect usage via Python because the SinglePassReader is only to be used when seeking is not required. This means that https://github.com/mxmlnkn/ratarmount/issues/106 requires a different fix altogether. Maybe using mmap with sequential prefetching.