madler / zlib

A massively spiffy yet delicately unobtrusive compression library.
http://zlib.net/
Other
5.7k stars 2.45k forks source link

error in decompressing zlib format file #766

Closed Dongmuliang closed 1 year ago

Dongmuliang commented 1 year ago

Hi, I recently found an error when attempting to decompress a zlib format file. However, I also tried another parser, i.e., libdeflate, and it can parse the file without issue. Specifically, I use this command (zpipe is built from the examples directory)

./zpipe -d < poc.zlib > output

The problematic file is attached (poc.zlib is included in the zip). poc.zip

madler commented 1 year ago

Thank you for the report and example. I checked, and that deflate data is indeed invalid and should be rejected, as zlib does.

From @ebigger's libdeflate decompression code:

Note: we don't need verify that the repeat count doesn't overflow the number of elements, since we've sized the lens array to have enough extra space to allow for the worst-case overrun

libdeflate is choosing to ignore this error in the deflate data.

Those reponsible for the compressor that generated the invalid deflate data should be informed.

Dongmuliang commented 1 year ago

Hi @madler, thanks for your quick check and the response, especially the explanation, which really helps!

I generated the data myself to test whether the algorithm works as expected.

During testing, I also discovered that many other inputs that have the same issue - zlib rejects them while libdeflate accepts them. I did some initial analysis and filtered out those with the same cause, but still have three interesting cases to share with you. The files are attached. pocs.zip

madler commented 1 year ago

libdeflate's inflate does not consume all of the input for any of those three examples. poc2 is invalid due to a bad distance code, poc3 due to an invalid literal/length code, and poc3 for the same reason as the first poc, which is too many code lengths. poc differed in that exactly all of its data was consumed by libdeflate's inflate.

In all cases, libdeflate is barrelling through the invalid deflate data and then finding that the Adler-32 check value, wherever it happened to think that is with whatever uncompressed data it emitted, checks out. I can only guess that this is by chance. In all cases, the resulting uncompressed data appears corrupted. Are there many examples that libdeflate's inflate also reports an error on?

I find it hard to believe that there is any deflate code out there making all of these errors. Your compressed data must be getting corrupted somewhere along the way.

Dongmuliang commented 1 year ago

I can only guess that this is by chance. In all cases, the resulting uncompressed data appears corrupted.

All these data is constructed by some mutations on an existing valid file (i.e., by AFL's fuzzing). Since zlib is very important and widely used in many applications, unexpected behaviors may have security issues (e.g., if a real-time system fails to parse a valid data in time may lead to some unexpected damages).

Although this data appears corrupted, the flaw can be maliciously exploited. Assuming malicious attackers know the flaws, thus they can construct meaningful data based on flaws in the implementation, and it is not hard to bypass Adler-32 check.

This exploitation can be very stealthy and hard to be detected, because all data seems good and may finally conclude it is because of flaws in implementation, not be regarded as an attack.

Are there many examples that libdeflate's inflate also reports an error on?

Yes, but both impls report the erorr, so there is no divergence

cause of a bad distance code, invalid literal/length code

Maybe this rejection process is not correct, because other impl does not reject it. Do you have a detailed specification on zlib error handling? (e.g., the specification clearly describes some rules that all implementation must follow). Then I am happy to conduct analysis and identify / eliminate the potential issues.

madler commented 1 year ago

All these data is constructed by some mutations on an existing valid file (i.e., by AFL's fuzzing).

Then why didn't you tell me that in the first place? You have been wasting my time checking to see if zlib is rejecting valid zlib data. You are deliberately corrupting the data, zlib is detecting the corruption, and that's a problem?! The problem is that libdeflate is not detecting corruptions that could be detected. I'm sure that if you tried hard enough, you could find a random corruption that converts a valid zlib stream into another valid zlib stream, and zlib would say it's good, because it is. So what?

Although this data appears corrupted, the flaw can be maliciously exploited. Assuming malicious attackers know the flaws, thus they can construct meaningful data based on flaws in the implementation, and it is not hard to bypass Adler-32 check.

That's just stupid. You do not need to try 1013 random mutations in order to construct meaningful zlib data. You can just run any compliant compressor with any malicious data you like and construct meaningful data the first time, every time! If you can insert flaws in the data, then you can insert your own, good, constructed zlib streams in there as well. A valid zlib stream is in no way any indication of authenticity, and was never intended to be.

You have not found anything that crashes either zlib or libdeflate, so there is no security issue.

Dongmuliang commented 1 year ago

I am so sorry for my mistake and taking your time. I read your reply carefully and partly agree with what you said.

You are deliberately corrupting the data, zlib is detecting the corruption, and that's a problem?!

Mutation does not necessary produce corruptted data, but I should first check the problem in libdeflate, this is my mistake. I should have told you all the backgroud at the first place, sorry again!

I'm sure that if you tried hard enough, you could find a random corruption that converts a valid zlib stream into another valid zlib stream, and zlib would say it's good, because it is. So what?

This is the important point, if zlib say it's good, while others say not, this is the indication that one have some logic errors. The question here is extremely hard to know it's good because it is is true. So I put here and ask for help.

l am very appreciated for your detailed reply!