mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines
Apache License 2.0
364 stars 7 forks source link

Add support for zip #23

Open mxmlnkn opened 1 year ago

mxmlnkn commented 1 year ago

This might make https://github.com/mxmlnkn/ratarmount/issues/105 more performance. Implementing it purely on the ratarmount side would result in having to open one ParallelGzipReader for each compressed file. This can quickly lead to memory overload and thread contention because each of these ParallelGzipReader would have their own cache and prefetcher.

One fix might be to make it possible to share the ThreadPool and the Chunk cache. This feels like it might complicate things and reasoning about it though.

The other idea would be some kind of "native" support for zip by ParallelGzipReader. In theory, it should be possible to provide a StenciledFileReader, that cuts out all zip headers and footers and stuff and only concatenates all raw deflate streams into one large one. This could actually also be done directly in ratarmount! It might even be sufficiently fast! Maybe that is already the solution.

The other idea was, to basically provide these stencils as an imported index, which should be constructible purely from the zip metadata. In this case, one chunk would be one file and the back-references would be empty because each file is independently compressed! However, this would not enable parallel decompression of large members and many small members might also be problematic, especially as the cache can hold only a fixed number of chunks instead of having a memory limit. Therefore, it would be kind of necessary to split large chunks and join small ones either during import (would not be on-the-fly unfortunately) or during first-touch, which would be very complex to implement, especially as the chunk index database was thought to be created in a kind of streaming manner from lowest to smallest and then never change. It might be hard to change this assumption error-free at all code locations, even though there are a lot of tests.