Closed pdjstone closed 6 months ago
I can provide the file if you need it (it's part of a publicly available firmware image)
Yes, could you please provide the file? It's nice to know that it works in 0.12.1, then I can do a bisection to find the problematic commit.
Thank you. Commit 0df5d753 introduced the bug, or at least it made it more likely to happen because it accidentally tries a chunk size of 1 B for verbose=True (and 0 B for verbose=False), which gets clamped to the allowed minimum of 8 KiB instead of the default 4 MiB. I have pushed a fix and will do a bugfix release today.
I want to give the underlying issue a look but I think I already know what happens: a full false positive. It happens because there is "only" 8 KiB of data that can be checked. And if for example a match for a non-compressed block was seemingly found, which isn't that unlikely (roughly once every 256 KiB) then the next up to 64 KiB of data cannot deliver anything to check against. Only the next deflate block header could then be checked and it would already be outside of the chunk size. I guess a minimum chunk size of ~128 KiB would give the probability of that happening a very huge bump down. And to not be concerned with probabilities at all, I could simply implement a fallback that decompresses on demand instead of throwing if such an error is detected.
If you want a quick and dirty workaround without installing from the updated source, try rapidgzip.open(fd, verbose=4 * 1024 * 1024, parallelization=4)
. It fixes the issue in your reproducer for me.
Thanks, that does fix it for me - I hadn't realised I was passing in the parameters incorrectly. However, I also see the issue with 0.13.0 when just doing rapidgzip.open(fd)
, and was also having generally poor performance on other files even when not encountering that specific exception. Is it choosing an 8 KiB chunk size by default?
Thanks, that does fix it for me - I hadn't realised I was passing in the parameters incorrectly.
You did everything correct. The bug was with the rapidgzip Python layer. I added a new argument for the chunk size and the I/O read method and this broke some calls that only used positional arguments.
Is it choosing an 8 KiB chunk size by default?
The C++ backend is doing std::max( 8_Ki, chunkSize )
, and because of the argument bug chunkSize
on the Python-side defaulted to 0. This should be fixed now and the performance should also be restored. The default should be 4 MiB.
I get the following exception when using rapidgzip 0.13.0 to decompress a specific raw deflate stream:
RuntimeError: Next block offset index is out of sync! Requested offset to index 11070 and got 1481048062 but expected 1481048069
. If I decompress the same data but with the gzip header and footer I don't get the exception. rapidgzip 0.12.1 works fine.Currently I can only reproduce this with a specific ~300MB file and I've not been able to create a reduced test case. I can provide the file if you need it (it's part of a publicly available firmware image), but the following code triggers the issue for me: