image-rs / jpeg-decoder

JPEG decoder written in Rust
Apache License 2.0
150 stars 87 forks source link

"failed to decode huffman code" on image that works fine in other software #229

Open SludgePhD opened 2 years ago

SludgePhD commented 2 years ago

The following image fails to decode with a "failed to decode huffman code" error, but it displays "fine" (with some corruption in the bottom part of the image) in Firefox, sxiv, VS Code, and Dolphin:

error

quilan1 commented 2 years ago

This is a case of a corrupt entropy stream, but it's not quite clear what the corruption is. I can, however, illustrate why the stream yields the error. We'll start from the file offset 0xD19 (3353'th byte). Here begins the coefficient decode for the MCU block (x=4, y=14), or in coordinates of the image, (32, 112), specifically component {id=2}.

Please excuse the crude diagram:

   |---69--|---C1--|---A9--|---5B--|---B7--|---1F--|---BE--|---8F--|---FE--|---FB--
   01101001110000011010100101011011100001110001111110111110100011111111111011111011
H3: ---100                   H6: ------011100                  ????????????????????
      H6: ------001101                   H1: --1
                  H1: --0                  H34: --------11
                     H1: --0                          H1: --0
                       H17: ----0                        H0: --

We start off with reading the DC coefficient; H3 means the Huffman symbol for '3' (110b in this table), followed by 3 bits (100b), to be calculated as some coefficient offset value. Irrelevant here. Then it reads the AC coefficients, until it reaches the Huffman '0' symbol, signifying that it's the end of the coefficient information. All good'ish (although I think this row is a bit suspect in general)!

Now comes the next set of coefficients we need to read, for component {id=3}. We have the following set of bits ahead in the stream: 11111111111011111011 -- eleven one-bits ahead.

The problem is that the DC Huffman table in use has no matching code for this pattern of eleven one-bits. It's got the following entries: Value Code Value Code
0 00 6 111110
1 01 7 1111110
2 10 8 11111110
3 110 9 111111110
4 1110 10 1111111110
5 11110 11 11111111110

At this point, I'm not sure what the correct approach to even trying to recover from the issue would be. Somewhere the stream got garbled by some bits, but there's (to my brief look) no obvious place. I suppose the bare minimum one could theoretically do to recover from this type of error, would be to scan forward for the next RST marker, and drop all of the coefficients until then (super ugly), but the problem here is that this stream doesn't have any.

micahsnyder commented 2 years ago

I have a couple more example jpegs that are handled by other software by fail to load with jpeg-decoder because of the "failed to decode huffman code" error.

1574264676-failed-to-decode-huffman-code

1853202599-failed-to-decode-huffman-code

Admittedly, my sample files look even more like garbage than @SludgePhD's file. But it would be nice if image-rs/jpeg-decoder could make a best effort. My use case is to generate image fuzzy hashes for malware identification. So my hope is to make a best effort even if the image is technically malformed.

fintelia commented 1 year ago

@micahsnyder If you want to match what other viewers display as closely as possible, I'd suggest relying on libjpeg since that's what those viewers are probably using. Any other decoder implementation is almost surely going to display some corrupted images differently since they don't usually target "bug for bug compatibility"

micahsnyder commented 1 year ago

@fintelia that's good advice. Thanks. In my case it's okay if it's not rendered the same, so long as it gives a usable image that can be hashed. Our tools for both creating the hashes and matching the hashes use the same code with image-rs / jpeg-decoder. So I think it is okay for us if the best effort comes out a little differently.

MoSal commented 1 year ago

So this basically relates to #252, except "partially-invalid" would be more accurate than "incomplete", and you wouldn't want the decoder to give up as early as possible if you want this to be generally useful.

This patch puts the decoder on par with libjpeg-turbo and FFmpeg (with this patch applied) in my very limited tests.

The first part of the patch (the read_marker part) was done to handle the image from @micahsnyder. This may not be how/where EOF errors should/would be handled.

The last part (continuing after decode block errors) should be generally useful. It fixed an image that brought me here, and it's enough to handle the image from @SludgePhD.