mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines
Apache License 2.0
364 stars 7 forks source link

Add compression for the index format #17

Closed mxmlnkn closed 4 months ago

mxmlnkn commented 1 year ago

For files with very large compression ratios, the index might actually grow larger than the original file because the seek points are 32 KiB of uncompressed data for each chunk (every 4 MiB per default). I.e., for compression ratios >128, the seek points take up as much as the original file. However, in these cases, the seek points would also compress very well even without back-references to the previous data. Therefore, compression should be done. There could be:

Other unrelated compression:

All of these optimizations would make the format incompatible with that of indexed_gzip. This is problematic, especially in combination with ratarmount, which currently stores indexes in the indexed_gzip format. However, the shaving off of bytes could be implemented by simply replacing them with zeros in order to make them compress more easily! Then, ratarmount could implement a compression for the whole index file or even in chunks and provide a Python file object abstraction to rapidgzip when #10 has been implemented. This would preserve the compatibility of ratarmount indexes agnostic of the backend (indexed_gzip, rapidgzip) when implemented correctly.

Still, for standalone use, the compression per seek point would still be useful.

mxmlnkn commented 1 year ago

Format Outline:

Index Format Outline:

 - Magic Bytes: Don't have to be super short, it's negligible compared to a window anyway.
 - Format Version
 - Flags
   - Offset Format:
     - Continous: Encoded chunk size is given by the distance to the next chunk. This is true for the decompressed size anyway
     - Patches: Chunks are stored as (decompressed offset, encoded offset, encoded size, window offset) tuples.
                Note that this means that in general encoded patches could be overlapping.
                For zip, this is used for zip bombs. Therefore it might be forbidden in the future,
                although I don't see why it should. This index format isn't meant for extracting to disk
                anyway, so a decompression bomb like this would only affect runtime, which seems ok to me
                because rapidgzip is meant for multi-gigabyte or even multi-terabyte files anyway.
                It is meant to support, e.g., making all the files inside a zip file accessible as a single
                seekable file object.
     - Continuous without windows: Intended for bzip2 index files
     - Patches without windows:
       -> maybe instead of a weird enum, use flags to enable encoded size and window offset tuple members
       -> same for CRC32 checksum
         -> Use enum for checknum instead: None, CRC32, Adler, SHA?...
         -> Store size of checksum member so that unknown checksums can simply be ignored.
            Else, we wouldn't know how long a member in the chunk database is
     - Flag whether chunks are sorted. Could be helpful to enable bisection search based on the encoded offset,
       for whatever reason. Else, kinda like a checksum
     - Flag whether windows are sorted? Not sure why that could be wanted.
       To decide whether access would be random or not? Simply as a checksum, i.e., if they are not actually sorted, then error out.
   - Window Compression:
     The windows should either be not compressed or compressed with the same algorithm as the accompanied archive.
     This is for dependency reasons. Decoders who only want to use this for LZ4 files should not be required to
     support all possible compression schemes.
     - None
     - Deflate
     - gzip (in contrast to Deflate this effects that the windows are also checksummed)
     - Deflate + Shared Dictionary (?) 
     - LZ4
     - ZSTD
     - bzip2
 - Compressed File Size
 - Decompressed File Size
 - Number of Chunks
 - Chunk offsets / size list in the format according to "Offset Format". Must be sorted by decoded offset.
   All offsets should be unsigned 64-bit.
   The encoded size probably could be 32-bit but who knows, simply be safe and make it 64-bit
   Each entry should be fixed size in order to theoretically facilitate bisection search instead of
   having to read all entries.
   - (decoded offset, encoded offset, encoded size, window offset, window size?, CRC32 checksum for this chunk!)
   - (decoded offset, encoded offset, encoded size, window offset, window size?, CRC32 checksum for this chunk!)
   - ...
 - Windows: Raw concatenated windows of varying size. In order to get the required window in O(1), offsets into
            this stream are also stored with the chunk offsets. Because of that window offset, these windows
            could be ordered randomly or windows might even be reused by different chunks. This is allowed,
            although I suspect that the usual case would be one window per chunk and windows are in the same
            order as the chunks.
            Note that 
   - Window 1:
     -
     - Size of this window block to facilitate jumping to the next window
       Theoretically, the window offsets above would suffice. Therefore, this is more a 
   - Window 2 
   - ...

General Notes:

 - All multi-byte numbers in the format described here are stored with
   the least-significant byte first (at the lower memory address).
 - encoded/compressed offsets and sizes are always in bits, decompressed in bytes
mxmlnkn commented 4 months ago

Kinda fixed with gztool index format support. There still might be yet another index format that works cross-compression and fixes some small issues with gztool indexes.