Open dherring opened 4 years ago
The issue here that the \MediaBox
occurs inside the compressed stream?
e.g. inside this?
<<
/Length 289
/Filter /FlateDecode
>>
stream
...binary stuff...
endstream
so we'd have to decompress the stream using the specified algorithm (FlateDecode) and then look for MediaBox in the result? Note that there are many possible compression algorithms: see https://blog.didierstevens.com/2008/05/19/pdf-stream-objects/
I believe that is the root problem and required fix.
I forgot that compressed streams had more options than binary objects. However only one or two may suffice. FlateDecode covers my use case. LZWDecode might also appear in the wild. I would be surprised if somebody used ASCIIHex, ASCII85, or RunLength for this. The CCITTFax, JBIG2, DCT, and JPX can be ruled out, as they apply to images. That leaves Crypt, which I think could justifiably require manual decryption for Pandoc.
It's more complicated than that...
Even if you resolve this - you are still missing a few things when trying to process PDF page sizes 1 - You should be using CropBox, when present, over MediaBox 2 - You need to consider Rotate, for when the page is rotated 3 - You need to consider UserUnit for when the page is scaled
Seen on recent Pandoc versions, including 2.9.2.1. Pandoc seems to correctly identify the PDF /MediaBox when it is in an object, compressed object, or uncompressed stream, but not when it is in a compressed stream. Unfortunately, pdflatex generates compressed streams by default, and so LaTeX files that include LaTeX figures do not convert correctly.
See attached file for a minimum working example. pandoc-pdf-issue.zip
Compare doc.docx with doc.pdf to see the issue. Look in the Makefile to see how everything is built.
The fix is probably near pandoc/src/Text/Pandoc/ImageSize.hs, but I didn't dig much beyond spotting the /MediaBox.