Add optimized file access that also works for non-seekable files

mxmlnkn commented 1 year ago

There are some use cases that want to create the index while downloading, so something like wget | tee downloaded-file | pragzip --export-index downlaoded-file.gzindex.

I think this should be solvable with a new non-seekable FileReader derived class with these two main ideas:

A working implementation could simply cache the whole file on-demand in memory. This way seeking will always work.
This obviously would use up too much memory for very large files. To remedy that an interface would be required to mark everything before an offset as not needed anymore.
- This way seeking will always work as long as the caller does not try to access parts that he marked himself as not needed anymore! When processing a block everything before the end of its compressed offset can then be marked as to be dropped!

Assuming that https://github.com/mxmlnkn/ratarmount/issues/106 is caused by the non-sequential file access, then this addition might also fix that.. Assuming I could use the non-seekable file reader during index creation. From the outside I know that I only have to go over the file sequentially but I would also require a Python interface reflecting that. It might be easier to do the gzip index creation as a separate step / pass. This way I could also implement a version that does not actually decompress anything but only gathers index seek points. That way memory usage would be limited even without implementing #2! This idea would be implementable with the refactoring done for #11. It just needs another specialized ChunkData subclass.

mxmlnkn commented 1 year ago

See https://github.com/mxmlnkn/indexed_bzip2/tree/single-pass-reader for an attempted implementation. I think I stopped working on that branch because the performance didn't look very good but still much better than single-threaded gzip. Also has lots of todos in the code comments in the last commit.

mxmlnkn commented 1 year ago

Assuming that https://github.com/mxmlnkn/ratarmount/issues/106 is caused by the non-sequential file access, then this addition might also fix that.

We also need to detect when to use the sequential access pattern and when we need to use pread because, for SSDs and others, pread will be faster than sequential reads!

Is file on SSD for Windows: [1] [2]
Is file on SSD for Linux: [[1]]() -> /sys/block/sda/queue/rotational or /sys/dev/block/<major>:<minro>/queue/rotational, where major/minor is something that stat returns, e.g., as Device: fd00h, or the st_dev field in the fstat return value. One might also look into the findmnt source code.
Still use pread when the file is detected to be mostly cached: [1]. Check how fincore is used by the identically named command line tool to check how much of the file has been paged/cached.

mxmlnkn commented 9 months ago

Partly fixed with https://github.com/mxmlnkn/indexed_bzip2/commit/9502781a9d2fa8b265fbe0a3b42177a1d5406836, but automatic detection is not yet implemented. Instead, I'm currently shooting for an --sequential or similar option. Maybe even make the pread the non-default case because sequential access probably is faster even on SSDs. It probably only is necessary to reach peak speeds for files on /dev/shm. Note that this does not affect usage via Python because the SinglePassReader is only to be used when seeking is not required. This means that https://github.com/mxmlnkn/ratarmount/issues/106 requires a different fix altogether. Maybe using mmap with sequential prefetching.

mxmlnkn / rapidgzip

Add optimized file access that also works for non-seekable files #13