mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines
Apache License 2.0
345 stars 7 forks source link

Found mismatching block problem #18

Closed wjd000 closed 10 months ago

wjd000 commented 1 year ago

[Info] Detected a performance problem. Decoding might take longer than necessary. Please consider opening a performance bug report with a reproducing compressed file. Detailed information: [Info] Found mismatching block. Need offset 1660952526 B 0 b. Look in partition offset: 1660944384 B 0 b. Found possible range: [1660949494 B 0 b, 1660949494 B 0 b] [Info] Detected a performance problem. Decoding might take longer than necessary. Please consider opening a performance bug report with a reproducing compressed file. Detailed information: [Info] Found mismatching block. Need offset 8921299241 B 0 b. Look in partition offset: 8921284608 B 0 b. Found possible range: [8921292099 B 1 b, 8921292099 B 1 b]

How to solve?

mxmlnkn commented 1 year ago

The most helpful would be if you could send me the gzip file that lead to this error or even only the megabyte around the problematic offsets. But in general, you can simply ignore it if there are only one or two of these warnings or if it decompresses sufficiently fast. In the future, I might completely hide this warning for release versions.

wjd000 commented 1 year ago

Thank you for your reply, but I’m sorry I cannot send the file to you. Could this be caused by multi-core? I used 60vcpu. Also, there are many warnings, but the decompression speed is very fast.

mxmlnkn commented 1 year ago

Yes, it should be a purely multi-core issue. It shouldn't happen for single-core decompression with -P 1. Note that I did benchmarks on 128-core systems, so a 60-core cpu should work.

Let me try to explain when this warning appears: For parallel decompression, each thread starts decompression at a different position inside the file. However, to succeed, they need to find valid start positions. In some cases, they might find something that did look like a valid start position but it isn't actually one. In these cases, the main thread can detect this and will have to decompress from the correct position again. The result from the wrong position will be discarded and all work done for that will have been for naught. That's why it prints a warning that performance might have been affected. The hope for that warning was, that I might be able to improve the detection of valid start positions in some of these cases. But, at the current stage of development, I think I have done almost everything that is possible to keep these cases to a minimum.

In general, the number of such warnings can be reduced by increasing the chunk size. E.g., you could try with rapidgzip --chunk-size $(( 16 * 1024 )) .... However, this also will increase the amount of used memory.

wjd000 commented 1 year ago

Thank you for your answer. Since I am busy with other things, I will test and modify the chunk size later. I compared the decompressed file with the source file and they are the same

mxmlnkn commented 10 months ago

I have hidden this informational message in default usage, therefore I will close this issue because I cannot reproduce it. It will still appear when --verbose is specified. Note that there also is a --verify option that computes and compares the CRC32, also in parallel. It still has a 10% overhead. Maybe I will make it default at some time in the future, when "performance" isn't my main goal anymore. In some aspects, I/O already is the bigger issue. It will be hard to write out 20+ GB/s to anywhere. Only when rapidgzip is used as a library in a kind of preprocessing step, do these speeds make sense.

mxmlnkn commented 10 months ago

I have actually found a file where this would happen. Namely a 23 MiB large .tar.gz that was gzip-compressed a second time, i.e., a .tar.gz.gz file. One of the speculative decompression threads seems to have found the stream of the original .tar.gz file inside the outer gzip stream and then was able to decode 4 blocks from it before it fell back to the outer gzip stream without any way to detect an error. I'm not entirely sure how the second gzip compression was done because it contains non-compressed blocks of only 8-16 kB size, which are all followed by 0-B non-compressed blocks (deflate flush markers). Anyway, rapidgzip can safely recover from this wrong decompression start and while it might slow down performance, this happens in non-compressed blocks, which "decompress" very fast.