pmaupin / pdfrw

pdfrw is a pure Python library that reads and writes PDFs
Other
1.87k stars 274 forks source link

Potential Parsing Error for some PDFs #11

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Using the code from the watermark.py as a sample, I attempted to Overlay the 
attached PDF Document into another PDF Document.

Minimal code reproduction example:

Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] on 
win32
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> from pdfrw import PdfReader
>>> from pdfrw.buildxobj import pagexobj
>>>
>>> xobj = pagexobj(PdfReader('boverlay-new.pdf').getPage(0))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python27\lib\site-packages\pdfrw\buildxobj.py", line 193, in pagexobj
    assert int(contents.Length) == len(contents.stream)
AttributeError: 'PdfArray' object has no attribute 'Length'

The Overlay file will open in PDF Readers (Foxit, Adobe), but pdfrw is unable 
to create a page object from the first page of the PDF.  The Overlay PDF was 
created using Adobe Indesign, and is attached.

What is the expected output? What do you see instead?
No overlay is produced, and the exception above is generated instead.

What version of the product are you using? On what operating system?
I have the latest pdfrw as retrieved from via SVN.  Windows 7, 64bit, using 
Python 2.7.3 32bit.

Please provide any additional information below.

Original issue reported on code.google.com by dancas...@gmail.com on 4 Nov 2013 at 5:23

Attachments:

GoogleCodeExporter commented 9 years ago
I also wanted to note that I tried the code suggestions that I saw in Issue 8.  
Adding the checks to determine if page.Contents was already a PdfArray seems to 
prevent the error message, but converting this PDF to a Stream and using it as 
an Overlay does not produce the expected results - the output document does not 
contain the watermark.

Original comment by dancas...@gmail.com on 4 Nov 2013 at 5:38

pmaupin commented 9 years ago

This is actually two issues. The first has been fixed by code on the master branch. The second is that the watermark file contains compression that is currently unsupported by pdfrw. (This is part of issue #5 )

You can use pdftk to uncompress the file:

pdftk sourcefile.pdf output destfile.pdf uncompress

and then the uncompressed watermark file will work.