ulikunitz / xz

Pure golang package for reading and writing xz-compressed files
Other
477 stars 45 forks source link

Remove the check on dictionary capacity #52

Closed bodgit closed 1 year ago

bodgit commented 1 year ago

I'm using your xz package in my 7-zip reader package to decompress LZMA and LZMA2 streams. I've had some reports of some archives being unreadable with a lzma: dictionary capacity too small error.

I thought I could override the default ReaderConfig and set a bigger dictionary capacity however it seems the check that triggers this error fires before the capacity is overridden by the ReaderConfig.DictCap value. The default DictCap value is much bigger than MinDictCap anyway, (8 MB vs 4 KB IIRC).

Removing this check allows the ReaderConfig.DictCap value to override whatever was in the stream header and it's guaranteed to be at least MinDictCap thanks to ReaderConfig.Verify().

The unreadable LZMA streams were then readable.

ulikunitz commented 1 year ago

Thank you for bringing that to my attention. Returning an error is indeed wrong, because the LZMA specification clearly states that a value smaller than 4096 byte should be set to 4096 byte. Since I want to do what the spec says I will not apply your pull request, but make sure that your approach of overriding the dictionary size works. I will also add a test case. You can expect a v0.5.11 in the next week. (I will also change the logic in my newer rewrite branch.)

I have one question though, would the 4096 dictionary size value work with your LZMA streams or do you require higher dictionary sizes? I would like to understand the real life situation.

bodgit commented 1 year ago

I have one question though, would the 4096 dictionary size value work with your LZMA streams or do you require higher dictionary sizes? I would like to understand the real life situation.

I'll try and find out what the value is actually set to. It's either going to be somewhere between 0 and 4096 or maybe its just 0. The archives in question are apparently quite old so it might just be an old/buggy 7-zip version that didn't encode the stream header correctly. I've got a test case made with a current version of 7-zip that uses LZMA which contains two streams, one for the archive metadata and another for the file contents and the dictionary is set to 4096 and 49152 respectively.

I'm not really familiar with the LZMA algorithm so I don't know what the behaviour is if the size is set too small. If the streams are setting a value that is non-zero, but less than 4096, then I suspect rounding up to 4096 would work fine. If it's unset, i.e. 0 because it wasn't encoded properly, then the library default of 8 MB seems to work fine but I'm not sure how to find an optimal value automatically.

ulikunitz commented 1 year ago

The LZ method supports copies of segments of the file preceding the current position of the file to be decompressed. The dictionary size describes the size of the sliding window from which the segments can be copied. If the window is larger than the dictionary size required, no issue will occur, if it is too small, the file will not be successful decoded. The window doesn't need to be larger than the stream to be decompressed. The archive metadata may be shorter than 4096 byte and so the data size might be used as window size, which would then be smaller than 4096 byte.

I suggest that you simply use the new implementation. If it's works without overriding "dictCap" than there is no issue, if you still need to override the value, your example stream is non-compliant with the LZMA specification.

ulikunitz commented 1 year ago

I released v0.5.11. The version should fix your issue.

ulikunitz commented 1 year ago

Hi, I would be interested to know whether the release works for you and whether it works without setting DictCap in the config parameter.

ulikunitz commented 1 year ago

Since it has been confirmed that 0.5.11 solves the issue, I close this pull request.

bodgit commented 1 year ago

It seems the streams with a dictionary capacity less than 4096 had the capacity set to 1024 or 2048, so ensuring it's at least 4096 seems to work fine.

Thanks for the bug fix!