retrieve data by byte offset and line position

mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines

Apache License 2.0

344 stars 7 forks source link

retrieve data by byte offset and line position #36

Closed koriege closed 1 month ago

koriege commented 3 months ago

Hi,

I guess you are aware of the https://github.com/circulosmeos/gztool project, which allows to index gz files including line information for random access data retrieval. Since I am using rapidgzip for full decompression and gztool for parallel random access decompression on a regular basis when working with fastq files, I would be very happy to have this combined in one tool. Can you imagine to implement it?

mxmlnkn commented 3 months ago

How exactly are you calling gztool? and, wouldn't ratarmount solve your feature request? It has rapidgzip as a backend and exposes a view to the decompressed file via FUSE. You can then use the usual POSIX APIs or dd to "extract"/copy from an offset. It would also have the advantage that the index stays cached in memory.

koriege commented 3 months ago

To e.g. get just the second line of a file, I do something like echo -e "foo\nbar\nzar" | gzip -c | tee foo.gz | gztool -v 0 -f -i -x -C -I foo.gzi && gztool -v 0 -L 2 -R 1 foo.gz. Ratarmount is my choice, when a tool requires uncompressed data, but in general I don't like to mount and unmount temporary directories all the time. I am also aware of dd capability of skipping bytes, but it is unsuitable if lines are composed of different numbers of bytes (wasn't this also the issue with general applicability of pugz?)

mxmlnkn commented 3 months ago

Ah ok, I was only thinking about byte offset. Lines are another matter altogether. And currently, the index does not store any line indexing information. However, I am working on a compressed and sparse index format and this feature request is in time to add line information as a consideration to the file. Alternatively, supporting gztool indexes would also be an option I would consider. I like when stuff like that (index compatibility) just works for the user. bgzip and indexed_gzip index files can already be loaded by rapidgzip.

I can imagine adding options like -L and -R. I already have "weird" options such as --count-lines and --count, which aren't solely for decompressing and indexing.

but in general I don't like to mount and unmount temporary directories all the time

Understandable. It also takes some runtime to set up a FUSE mount point and there are other difficulties. You could also do:

python3 -c '
import sys
import ratarmountcore as rmc
a = rmc.open(sys.argv[1])
name = next(iter(a.listDir("/").keys()))
f = a.open(a.getFileInfo("/" + name))
f.seek(10)
sys.stdout.buffer.write(f.read(5))
' test.py.gz

The rapidgzip Python-bindings can also be used directly, but then index saving and loading would have to be managed manually.

I am aware that this is too cumbersome for a solution. It would be nicer to have an option directly in the rapidgzip CLI.

wasn't this also the issue with general applicability of pugz?

No. That was only about the deflate block searching heuristic that wasn't stable and fast enough to work with anything other than "printable" characters and even then might fail because of false positives.

koriege commented 3 months ago

However, I am working on a compressed and sparse index format and this feature request is in time to add line information as a consideration to the file.

Sounds great.

In the meantime I made some interesting observation, I wonder weather you could comment on them (or even include them into your benchmark). Even though, I did not turned off NUMA, cache effects and kept cpu frequencies constant, I can say, that the throughput/bandwidth when piping decompressed fastq data back into compression tools heavily depends on the used tool. Whereas I was not able to achieve more then 110MB/s for pigz (76 threads), the maximum throughput of 400MB/s was gained by using rapidgzip on 8 threads streamed into bgzip with 32 threads.

mxmlnkn commented 3 months ago

In the meantime I made some interesting observation, I wonder weather you could comment on them (or even include them into your benchmark). Even though, I did not turned off NUMA, cache effects and kept cpu frequencies constant, I can say, that the throughput/bandwidth when piping decompressed fastq data back into compression tools heavily depends on the used tool. Whereas I was not able to achieve more then 110MB/s for pigz (76 threads), the maximum throughput of 400MB/s was gained by using rapidgzip on 8 threads streamed into bgzip with 32 threads.

I don't think that I entirely understand your benchmark. Are you doing rapidgzip -d -c -P 8 | pigz and rapidgzip -d -c -P 8 | bgzip? How fast are pigz and bgzip without any decompression going on, i.e., a simple cat | pigz? Is there still a difference between bgzip and pigz? Is there a difference topigz <file>? How is CPU utilization? I tried with a small 888 MB perf.data file cat prf.data | pigz is 5x faster than bgzip, so it seems that I cannot reproduce your results. In general, I think that in the range of < 1 GB/s, caching and NUMA shouldn't come into effect yet, but it might depend on the system. Rather it seems as if pigz doesn't parallelize at all. 110 MB/s seems very slow. My 888 MB compressed-decompressed-and-compressed-again time pigz -c perf.data | rapidgzip -d -c | pigz > /dev/null takes ~5 s on my notebook with 8 cores (x2 for hyperthreading).

koriege commented 3 months ago

Are you doing rapidgzip -d -c -P 8 | pigz and rapidgzip -d -c -P 8 | bgzip?

exactly. I am on an AMD EPYC 7513 with an NVMe that should be capable of 7BG/s reading. cat /dev/null makes 2GB/s, pigz -p 2 -kcd and bgzip -@ 2 -kcd around 400MB/s. rapidgzip > /dev/null makes 1.8GB/s. cat | pigz -c > /dev/null maxed out at 48 compression threads resulting in 800MB/s. cat | bgzip -@ 42 -c > /dev/null achieved 900MB/s. pigz | pigz and pigz | bgzip reached 90MB/s. rapidgzip | pigz for 1 to 32 decompresison threads and 1 to 128 compression threads berformed best at rapidgzip -P 4 | pigz -p 76 resulting in 110MB/s. rapidgzip -P 8 | bgzip -@ 32 resulted in 200MB/s and rapidgzip -P 8 --import-index | bgzip -@ 32 in 400MB/s.

I was trying a 5gb and 50gb compressed fastq file. CPU load was according to my parameterizations.

mxmlnkn commented 2 months ago

Regarding the issue title, I see the following steps:

[x] Add read support for gztool indexes
[x] Add write support for gztool indexes
[x] Gather line information
[x] Add write support for gztool indexes with line information
[x] Add an API to read data by offset
[x] Add an API to read data by lines

Roughly, in this order, but I am starting with the API. This was also the reason for asking about the gztool usage. Personally, I dislike the single-letter options. They are hard to read. My current idea for the API would be a single option --ranges that may only be specified in combination with --decompress/-d. Example usages would be:

-d --ranges 10@1: skip 1 byte, read the next 10 bytes
-d --ranges 10@1,2@0: ibid and then print the first 2 bytes. Note that overlapping and out-of order ranges are explicitly allowed.
-d --ranges L10@L1: skip 1 line, read the next 10 lines

This specification has some advantages:

As opposed to specifying ranges as from-to, there are fewer invalid cases. For from-to, we would have to change that to is > than from and it may not even be clear whether the to is inclusive or exclusive.
As opposed to the gztool API, this enables printing multiple things with one call.
The @ (at) should make the expression kind of self-explanatory.
It uses one unified range list instead of two options --byte-ranges and --line-ranges because with two options it wouldn't be clear which of these range lists should be processed first, or I would have to check and throw an error if both are specified at the same time.

And here is where it becomes complicated:

Should mixes of lines and bytes be allowed?
- Something like 10@L10 (read the first 10 bytes at line 10) feels like it could be useful.
- However, something like L10@10 feels less useful. -> Yes, should be allowed
Throw errors for 0-length ranges (0@10) or simply ignore them? I'm slightly favoring the latter. -> Ignore zero-size ranges.
Throw errors or ignore ranges intersecting or after the file end?
- Again, I'm slightly tending towards simply printing as much as possible without errors.
- But, this might invalidate some invariant assumptions such as that the output should be as long as the sum of the byte range lengths. Probably not so important.
The age-old question: begin counting at 0 or 1?
- I'm favoring 0, especially for the bytes, but it feels less natural for line numbers because most text editors count lines from 1, but I guess it could be explained by referring to the (1st) line itself instead of the line at/after (line) offset 0.
- However, I especially do not want to have an inconsistent mix of lines beginning with 1 and bytes beginning with 0. That would be even more confusing. -> Always begin counting at zero. Or alternatively, interpret the offset value as: "The amount of lines to be skipped". This makes it clear that after skipping 0 lines, you start reading from the "first" line.

mxmlnkn commented 1 month ago

This took a bit longer and more code lines than I thought, especially the newline logic. And it still could benefit from many more unit tests, but at least there are some.

Issue #10 is still not implemented, therefore working with large gztool indexes to extract only very small amounts of data is not recommended because the index is always read fully into memory. However, the option to specify multiple ranges should help to alleviate this. Else, the other way around: creating indexes with rapidgzip and accessing lines or byte offsets via gztool could be a viable option.

I have merged it. It would be helpful if you could give it a try. There might still be bugs with the newline features. You can install it with:

python3 -m pip install --force-reinstall 'git+https://github.com/mxmlnkn/indexed_bzip2.git@master#egginfo=rapidgzip&subdirectory=python/rapidgzip'

This introduces the new command line arguments --index-format and --ranges and makes --import-index automatically work with gztool indexes.

koriege commented 1 month ago

Wow, you made it - this is super cool.. fast decompression with an offset! As far as I can tell you, I didn't observe any issues so far.

But I'd like to add minor recommendations.

1) When a user wants to decompress from, lets say line 4000 to the very end of a file, currently one has to do something like --ranges 999T@4000L or --ranges 99999999L@4000L. Maybe something like inf@4000L or simply @4000L could be useful? 2) --count and --count-lines options could profit from a gztool index as well, since gztool -l </path/to/indexed/gz> has this information included 3) Do you think the gztool index and rapidgzip index could be used or even fused together some when in the future, for an even more efficient decompression?

koriege commented 1 month ago

followup: index creation from stdin does not work anymore. rapidgzip now complains All window sizes must be at least 32 KiB or empty!

mxmlnkn commented 1 month ago

1. When a user wants to decompress from, lets say line 4000 to the very end of a file, currently one has to do something like
   `--ranges 999T@4000L` or `--ranges 99999999L@4000L`. Maybe something like `inf@4000L` or simply `@4000L` could be useful?

[x] Ah yes. I remember thinking about it at some point but then forgot.

2. `--count` and `--count-lines` options could profit from a gztool index as well, since `gztool -l </path/to/indexed/gz>` has this information included

[x] I thought I had something like that implemented for --count already, but that doesn't seem to be the case (anymore). But yes --count-lines can also have a fast path for gztool indexes.

3. Do you think the gztool index and rapidgzip index could be used or even fused together some when in the future, for an even more efficient decompression?

Yes, I think I already mentioned it further above. I am working on yet another index format because I also found some small nitpicks with gztool indexes. I have started it here https://github.com/mxmlnkn/indexed_bzip2/blob/fbec01816f386810f392b7ebef9506acc5a0e8e1/src/rapidgzip/IndexFileFormat.hpp#L223 but I want to wait for the LZ4 implementation to see whether the format is sufficiently generic. I am not sure what you mean with "even more efficient decompression". The index format probably cannot optimize much more for that. The ideal case is bgzip because it produces very small indexes without windows, which reduces I/O read times, processing times, and memory usage. But this is something the compressor has to do. I have already been optimizing a bit to automatically detect windows that are not required at all plus the sparse window detection to reduce index sizes by ~3.

followup: index creation from stdin does not work anymore. rapidgzip now complains All window sizes must be at least 32 KiB or empty!

[x] Damn, so many tests and weird CI issues I had to fix (literal stack overflows on MacOS because std::array<char, 128*1024> ballooned the stack when used for multiple local variables) and there are still major bugs. I have most command line invocations tested in testCLI but anything relating to stdin, is a hassle to test from inside C++.

However, I am getting:

Caught exception: Failed to detect a valid file format.

when trying:

m rapidgzip && cat base64-8MiB.gz | src/tools/rapidgzip --export-index index

I'm not sure why you are seeing another error message.

The introducing commit for this bug is: https://github.com/mxmlnkn/indexed_bzip2/commit/582d6f7d002f2f7970b7d83d5fc502131655875d. As you may be able to see in the commit, it was a really dumb copy-paste error...

mxmlnkn commented 1 month ago

I have pushed a fix for the stdin problem and will release 0.14.1 after the CI has finished. Thank you for noticing the bug! I have also added a test, so it shouldn't happen again so easily. The other two things, as they are small features and performance optimizations, not bugfixes, will probably have to wait until the next minor release.

Ah, did I mention that the sparseness is also used for the gztool indexes? This is why the new sparseness feature meshes surprisingly well with the gztool support feature, both newly added in rapidgzip 0.14.0. E.g., try this:

base64 /dev/urandom |head -c $(( 4 * 1024 * 1024 * 1024 )) | gzip > 4GiB-base64.gz

rapidgzip --no-sparse-windows --index-format gztool --export-index 4GiB-base64.gz{.index,} && stat -c %s 4GiB-base64.gz.index
# -> 39 MB (38811851 B)
# 3 MiB spacing instead of the default 10 MiB for roughly comparable comparison with rapidgzip
gztool -s 3 -I 4GiB-base64.gz{.gzi,} && stat -c %s 4GiB-base64.gz.gzi
# 34 MB -> (33983860 B)

rapidgzip --index-format gztool --export-index 4GiB-base64.gz{.index,} && stat -c %s 4GiB-base64.gz.index
# -> 702 kB (701964)
# The index is compatible with gztool of course, even though it is ~50x smaller!
gztool -b 0 -I 4GiB-base64.gz{.index,}
# ACTION: Extract from byte = 0
# 
# Index file '4GiB-base64.gz.index' already exists and will be used.
# Processing '4GiB-base64.gz' ...
# Processing index to '4GiB-base64.gz.index'...
# 4.00 GiB (4294967296 Bytes) of data extracted.
# 
# 1 files processed
# 
# 229af148b81ef63faf7e723d50173d1f  -
gzip -d -c 4GiB-base64.gz | md5sum
# 229af148b81ef63faf7e723d50173d1f  -

Of course, random base64 data is almost the best case for sparseness ;). For wikidata.json, I observed the mentioned 3x size reduction of the gztool index from ~10 GB indexed_gzip index (windows not compressed) down to ~1.5 GB gztool index (windows compressed) down to ~500 MB for compressed sparse windows for the 102 GiB wikidata.json.gz, which decompresses to ~1 TB.

koriege commented 1 month ago

Ah, did I mention that the sparseness is also used for the gztool indexes?

No, you didn't :) but I already wondered how you achieved gztool index size reduction..

I am not sure what you mean with "even more efficient decompression". The index format probably cannot optimize much more for that.

I meant: When using the default rapidgzip index and the same number of threads as without using the index, the decompression throughput is much higher due to ISA-L. When using the gztool index, I guess the algorithm falls back to your "own custom-written gzip decompression engine". To overcome this, I asked if it would be possible to make use of both indices i.e. using gztool index to infer the line/byte offset and the rapidgzip index for the actual decompression.

koriege commented 1 month ago

I am sorry to bother you again with stdin related issues. Using v0.14.1 index export works with "indexed_gzip" and "gztool" keywords, but fails on "gztool-with-lines" giving me the error

terminate called after throwing an instance of 'std::invalid_argument'
  what():  Cannot seek backwards with non-seekable input!
Aborted (core dumped)

mxmlnkn commented 1 month ago

I meant: When using the default rapidgzip index and the same number of threads as without using the index, the decompression throughput is much higher due to ISA-L. When using the gztool index, I guess the algorithm falls back to your "own custom-written gzip decompression engine". To overcome this, I asked if it would be possible to make use of both indices i.e. using gztool index to infer the line/byte offset and the rapidgzip index for the actual decompression.

You are right about ISA-L and the slower custom decoder. But, ISA-L is also used when importing a gztool index. I.e., gztool already should give all the benefits, no reason to combine it with the old index format. There is one edge case that gztool indexes do not work with:

> rapidgzip --index-format gztool --export-index SRR22401085_2.fastq.gz{.index,}
> cat SRR22401085_2.fastq.gz | /home/maximiliank/.local/bin/rapidgzip --count --import-index SRR22401085_2.fastq.gz.index
Traceback (most recent call last):
  File "/home/maximiliank/.local/bin/rapidgzip", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "rapidgzip.pyx", line 685, in rapidgzip.cli
ValueError: Cannot import gztool index without knowing the archive size!

This is a known limitation of the gztool index because it does not store the total archive size but rapidgzip kinda needs it and therefore tries to query it from the file and this will not work for stdin input.

I am sorry to bother you again with stdin related issues. Using v0.14.1 index export works with "indexed_gzip" and "gztool" keywords, but fails on "gztool-with-lines" giving me the error

[x] No bother. As the message says, are you using non-seekable input like stdin? In that case, it is a known limitation because the line information is currently gathered in post, i.e., rapidgzip needs to seek back in the file, which is not possible for stdin.
[x] I am also getting a ValueError: Window too small! error with rapidgzip --export-index issue-37.tar.gz{.index,}. Decompression works fine, so it seems to come from the index exporting. I'll have to take a look at this later.

koriege commented 1 month ago

gztool already should give all the benefits

I see. So except for index file size and the use case with --count option, there are no other drawbacks when switching to gztool indices?

No bother. As the message says, are you using non-seekable input like stdin? In that case it is a known limitation because the line information are currently gathere in post, i.e., rapidgzip needs to seek back in the file, which is not possible for stdin.

Yes, that's correct, because I would like to create the index asynchronously while compressing the data and not afterwards by doing gzip -kc <file> | tee >(rapidgzip --index-format gztool-with-lines --export-index <file.gzi>) > <file.gz>. This currently works for me using gztool this way gzip -kc <file> | tee >(gztool -v 0 -f -i -x -C -I <file.gzi>) > <file.gz>. If this would be able to achieve with your implementation one could avoid using gztool completely and an top profit from your sparseness feature.

mxmlnkn commented 1 month ago

gztool already should give all the benefits

I see. So except for index file size and the use case with --count option, there are no other drawbacks when switching to gztool indices?

Yes.

No bother. As the message says, are you using non-seekable input like stdin? In that case it is a known limitation because the line information are currently gathere in post, i.e., rapidgzip needs to seek back in the file, which is not possible for stdin.

Yes, that's correct, because I would like to create the index asynchronously while compressing the data and not afterwards by doing gzip -kc <file> | tee >(rapidgzip --index-format gztool-with-lines --export-index <file.gzi>) > <file.gz>. This currently works for me using gztool this way gzip -kc <file> | tee >(gztool -v 0 -f -i -x -C -I <file.gzi>) > <file.gz>. If this would be able to achieve with your implementation one could avoid using gztool completely and an top profit from your sparseness feature.

I see, thanks for the feedback. I'll see what I can do for the next minor release. It might be more difficult than it sounds because of the abstraction layers, but maybe I'll "just" have to move the line counting one level deeper (into ParallelGzipReader or even into the ChunkData, instead of doing it outside in the CLI logic).

mxmlnkn commented 1 month ago

I have fixed some of the performance ideas for --count-lines.

Yes, that's correct, because I would like to create the index asynchronously while compressing the data and not afterwards by doing gzip -kc <file> | tee >(rapidgzip --index-format gztool-with-lines --export-index <file.gzi>) > <file.gz>. This currently works for me using gztool this way gzip -kc <file> | tee >(gztool -v 0 -f -i -x -C -I <file.gzi>) > <file.gz>. If this would be able to achieve with your implementation one could avoid using gztool completely and an top profit from your sparseness feature.

If you are compressing files, I would really recommend bgzip, which also should be a drop-in replacement for gzip and is available via the Debian/Ubuntu package manager. It can also create an index while compressing files, although those indexes do not contain line information. Rapidgzip should also be a lot faster when the input is a bgzip file because it can delegate to ISA-L.

mxmlnkn commented 1 month ago

I have pushed a commit that should make your use case of using stdin input to create an index work to develop. You can try it out with:

python3 -m pip install --force-reinstall 'git+https://github.com/mxmlnkn/indexed_bzip2.git@develop#egginfo=rapidgzip&subdirectory=python/rapidgzip'

It needs a bit more testing before it will be merged.

koriege commented 1 month ago

If you are compressing files, I would really recommend bgzip, which also should be a drop-in replacement for gzip and is available via the Debian/Ubuntu package manager. It can also create an index while compressing files, although those indexes do not contain line information. Rapidgzip should also be a lot faster when the input is a bgzip file because it can delegate to ISA-L.

I know :) To add up on this, bgzip is also much faster than pigz. But be aware of the version you are using, because from v1.16 on, the developers switched to hfile with a default block size of 4kb when data comes from stdin, which heavily reduces throughput. See my issue opened here bgzip performance drop from v1.16. Unfortunately, the pull request which partially should fix this issue by increasing the block size, is not merged yet.

koriege commented 1 month ago

I have pushed a commit that should make your use case of using stdin input to create an index work to develop. You can try it out with:
python3 -m pip install --force-reinstall 'git+https://github.com/mxmlnkn/indexed_bzip2.git@develop#egginfo=rapidgzip&subdirectory=python/rapidgzip'
It needs a bit more testing before it will be merged.

Works like charm. You also implemented the infinity option for --ranges inf@[..] Thank you so much!