LZW attempts to decode buffer after output is already filled

dalcde commented 7 months ago

In DecodeState::advance, after processing a burst, the decoder unconditionally processes the new code. However, we shouldn't do so if the output is already filled, because the remaining bits in the buffer may be nonsense. This caused an InvalidCode error when trying to read one of my images.

I attempted to write a fix at https://github.com/dalcde/lzw/tree/check-out but I didn't make a PR because I'm not confident it is correct.

HeroicKatora commented 7 months ago

Can you share the reproduction, the binary lzw stream if possible.

dalcde commented 7 months ago

I added a (failing) test in https://github.com/dalcde/lzw/tree/add-test . I checked that my fix indeed fixes the test, but it breaks other tests, so it is definitely incorrect. I can try to make the tests pass but I still wouldn't trust the result to be correct...

HeroicKatora commented 7 months ago

I don't see this being a bug. The input stream is invalid, it should end with an end code (129 = 0x81), and it doesn't since it instead has the code 0xff at this location. The buffer size shouldn't influence the validity check of the stream and it doesn't in this case. If the stream should end early, it shouldn't be supplied in the input.

dalcde commented 7 months ago

Empirically some programs seem to produce TIFF files that are missing these end codes. Someone else has run into it here: https://stackoverflow.com/questions/55674925/decoding-tiff-lzw-codes-not-yet-in-the-dictionary

It would be helpful to support this case even if it is technically invalid.

HeroicKatora commented 7 months ago

The library reports the valid filled output buffer size as part of BufferResult (and all the other variants), even in the error case (that was part of the considerations when choosing to return a structure instead of a Result at the top level here). If an invalidly terminated streams should be considered valid, does it work to check whether the output buffer was filled far enough and then ignore the reported InvalidCode error?

We seem to be missing explicit test cases for that guarantee though. The minimal example you've got in your branch would be perfect for verifying these assertions, with a variant for each of the IO styles if possible.

dalcde commented 7 months ago

I think that makes it a bit awkward to write a Read wrapper around it; you want this to fail after the user asks for the next byte, so you need to cache that.

On 7 April 2024 12:48:54 am HKT, Andreas Molzer @.***> wrote:

The library reports the valid filled output buffer size as part of BufferResult (and all the other variants), even in the error case (that was part of the considerations when choosing to return a structure instead of a Result at the top level here). If an invalidly terminated streams should be considered valid, it should work to check whether the output buffer was filled far enough and then ignore the reported InvalidCode error.

We seem to be missing explicit test cases for that guarantee though. The minimal example you've got in your branch would be perfect for verifying these assertions, with a variant for each of the IO styles if possible.

-- Reply to this email directly or view it on GitHub: https://github.com/image-rs/lzw/issues/41#issuecomment-2041135964 You are receiving this because you authored the thread.

Message ID: @.***>

fintelia commented 7 months ago

If the tiff crate is failing to decode some images produced by a widely used encoder, then I think it might make sense to treat this as a bug at the level of that crate. ImageMagick has its own TIFF implementation ~~so it is plausible it sometimes generates corrupt files, though if so, would be nice to file an issue upstream~~. Edit: That implementation calls into libtiff for the actual compression

HeroicKatora commented 7 months ago

The API required to do this properly might look like a form of Read::take, some new control to exit early on reaching some limit of total bytes. I suppose that is feasible, but it's unclear how to achieve it without loss of performance. It might require a separately monomorphized (i.e. by const-generics) control loop.

dalcde commented 7 months ago

I'm happy to file this with the tiff library instead, but it's not wrong for lzw to ignore the remaining input after the output buffer is filled, and there should be no/minimal performance impact when doing so (we can check this only if we hit the invalid code branch). I think it would make life easier for downstream users if lzw handles this

On 8 April 2024 9:34:18 pm HKT, Andreas Molzer @.***> wrote:

The API required to do this properly might look like a form of Read::take, some new control to exit early on reaching some limit of total bytes. I suppose that is feasible, but it's unclear how to achieve it without loss of performance. It might require a separately monomorphized (i.e. by const-generics) control loop.

-- Reply to this email directly or view it on GitHub: https://github.com/image-rs/lzw/issues/41#issuecomment-2042774461 You are receiving this because you authored the thread.

Message ID: @.***>

fintelia commented 7 months ago

Please do create the the tiff crate issue. There's no need to close this issue, but it would help to discuss in the context of specific real-world files that are failing. That way we can decide how/if to handle those files.

(And if I've misunderstood and your failing LZW streams aren't from TIFF files please do say so!)

image-rs / weezl

LZW attempts to decode buffer after output is already filled #41