Closed mxmlnkn closed 1 month ago
See https://github.com/mxmlnkn/indexed_bzip2/tree/single-pass-reader for an attempted implementation. I think I stopped working on that branch because the performance didn't look very good but still much better than single-threaded gzip. Also has lots of todos in the code comments in the last commit.
Assuming that https://github.com/mxmlnkn/ratarmount/issues/106 is caused by the non-sequential file access, then this addition might also fix that.
We also need to detect when to use the sequential access pattern and when we need to use pread
because, for SSDs and others, pread
will be faster than sequential reads!
/sys/block/sda/queue/rotational
or /sys/dev/block/<major>:<minro>/queue/rotational
, where major/minor is something that stat returns, e.g., as Device: fd00h
, or the st_dev
field in the fstat
return value. One might also look into the findmnt
source code.pread
when the file is detected to be mostly cached: [1]. Check how fincore
is used by the identically named command line tool to check how much of the file has been paged/cached.Partly fixed with https://github.com/mxmlnkn/indexed_bzip2/commit/9502781a9d2fa8b265fbe0a3b42177a1d5406836, but automatic detection is not yet implemented. Instead, I'm currently shooting for an --sequential
or similar option. Maybe even make the pread the non-default case because sequential access probably is faster even on SSDs. It probably only is necessary to reach peak speeds for files on /dev/shm. Note that this does not affect usage via Python because the SinglePassReader is only to be used when seeking is not required. This means that https://github.com/mxmlnkn/ratarmount/issues/106 requires a different fix altogether. Maybe using mmap with sequential prefetching.
There are some use cases that want to create the index while downloading, so something like
wget | tee downloaded-file | pragzip --export-index downlaoded-file.gzindex
.I think this should be solvable with a new non-seekable
FileReader
derived class with these two main ideas:Assuming that https://github.com/mxmlnkn/ratarmount/issues/106 is caused by the non-sequential file access, then this addition might also fix that.. Assuming I could use the non-seekable file reader during index creation. From the outside I know that I only have to go over the file sequentially but I would also require a Python interface reflecting that. It might be easier to do the gzip index creation as a separate step / pass. This way I could also implement a version that does not actually decompress anything but only gathers index seek points. That way memory usage would be limited even without implementing #2! This idea would be implementable with the refactoring done for #11. It just needs another specialized
ChunkData
subclass.