jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.87k stars 3.34k forks source link

not detecting the size of some PDF images #6321

Open dherring opened 4 years ago

dherring commented 4 years ago

Seen on recent Pandoc versions, including 2.9.2.1. Pandoc seems to correctly identify the PDF /MediaBox when it is in an object, compressed object, or uncompressed stream, but not when it is in a compressed stream. Unfortunately, pdflatex generates compressed streams by default, and so LaTeX files that include LaTeX figures do not convert correctly.

See attached file for a minimum working example. pandoc-pdf-issue.zip

Compare doc.docx with doc.pdf to see the issue. Look in the Makefile to see how everything is built.

The fix is probably near pandoc/src/Text/Pandoc/ImageSize.hs, but I didn't dig much beyond spotting the /MediaBox.

jgm commented 4 years ago

The issue here that the \MediaBox occurs inside the compressed stream? e.g. inside this?

<<
/Length 289       
/Filter /FlateDecode
>>
stream
...binary stuff...
endstream

so we'd have to decompress the stream using the specified algorithm (FlateDecode) and then look for MediaBox in the result? Note that there are many possible compression algorithms: see https://blog.didierstevens.com/2008/05/19/pdf-stream-objects/

dherring commented 4 years ago

I believe that is the root problem and required fix.

I forgot that compressed streams had more options than binary objects. However only one or two may suffice. FlateDecode covers my use case. LZWDecode might also appear in the wild. I would be surprised if somebody used ASCIIHex, ASCII85, or RunLength for this. The CCITTFax, JBIG2, DCT, and JPX can be ruled out, as they apply to images. That leaves Crypt, which I think could justifiably require manual decryption for Pandoc.

lrosenthol commented 4 years ago

It's more complicated than that...

Even if you resolve this - you are still missing a few things when trying to process PDF page sizes 1 - You should be using CropBox, when present, over MediaBox 2 - You need to consider Rotate, for when the page is rotated 3 - You need to consider UserUnit for when the page is scaled