jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.
https://borbpdf.com/
Other
3.4k stars 147 forks source link

Fix compressed_jpeg_image_transformer.py to use decoded bytes instead of original #142

Closed jcallaha closed 2 years ago

jcallaha commented 2 years ago

After calling decode_stream() to deflate the image stream before handing it off to PIL, reference "DecodedBytes" instead of "Bytes" as "Bytes" is the original stream data instead of the decoded data.

jorisschellekens commented 2 years ago

After giving this some thought, I think this works as intended. PIL expects these bytes to be compressed (and indeed from the tests you can verify that PIL is able to extract an image from a PDF)

jcallaha commented 2 years ago

Hi Joris - thanks for the quick response!

I think now I see the full issue - in my case the stream is set as /Filter[/FlateDecode/DCTDecode] so the JPEG image is compressed. If the image stream is just DCTDecode then PIL does the right thing with the stream bytes directly - but in that case the transform is actually done by jpeg_image_transformer.py because is it added first and it is only looking for DCTDecode on the image stream.

It looks like the code in compressed_jpeg_image_transformer.py intends to handle the case of the double encoding because it returns True for can_be_transformed() if the Filter is either just "DCTDecode" or if it is an array of filters with DCTDecode at the end:

                object["Filter"] == "DCTDecode"
                or (
                    isinstance(object["Filter"], list)
                    and len(object["Filter"]) > 1
                    and object["Filter"][-1] == "DCTDecode"
                )

I think the first case (object["Filter"] == "DCTDecode") is already handled by jpeg_image_transformer.py so the compressed_jpeg_image_transformer.py could probably be simplified. Not sure how to proceed - I wish I had a simple test case I could share.