pmaupin / pdfrw

pdfrw is a pure Python library that reads and writes PDFs
Other
1.86k stars 271 forks source link

Wrong number of pages recognized #121

Open tisimst opened 6 years ago

tisimst commented 6 years ago

So, I have a friend who uses an app I developed that uses pdfrw to do various file manipulations. He contacted me about one particular file (5 pages in length) he modified using PDF Expert to remove two unnecessary pages (3 pages in length now) before using my app. Every PDF viewer I've opened the file with shows the correct number of pages = 3. However, pdfrw still seems to think that the 3 page file still has 5 pages.

I'm attaching the two files so you can take a look. I tried to do my own digging to figure out what's going on, but I'm quite stumped. Any ideas and/or recommendations to identify the "correct" number of pages would be very helpful.

So, to be perfectly clear, the 5-page file is the original. The 3-page file is the result of using PDF Expert to remove the original pages 2 & 3.

Original: 13-All That-5pages.pdf Edited: 13-All That-3page edit.pdf

pmaupin commented 6 years ago

There must be a way in the outlines or something to tell PDF readers to ignore pages. Those "removed" pages are still there. For proof of that, run this, and take a look at dummy.pdf.

>>> import pdfrw
>>> o = pdfrw.PdfWriter()
>>> o.addpages(pdfrw.PdfReader('13-All.That-3page.edit.pdf').pages)
<pdfrw.pdfwriter.PdfWriter object at 0x7f18f57a4f50>
>>> o.write('dummy.pdf')
tisimst commented 6 years ago

Thanks for that insight, @pmaupin ! I get back all 5 pages, too! That was my assumption, that something was just being marked as "ignore me". I will take a look at the PDF spec to see what might be doing this. I'm quite surprised this is even possible, but no one consulted me on the matter when writing the spec ;-)

tisimst commented 6 years ago

Is there a way to do a raw dump all the decrypted/decompressed content to a file? It appears that the PDF spec (I think it was implemented in PDF 1.5) that allows for "optional" content. Being able to see all the raw contents would be helpful to know if this is the mechanism being used.