Problem with "/Contents" in some pdf's

ralsina / pdfrw

Automatically exported from code.google.com/p/pdfrw

Other

0 stars 0 forks source link

Problem with "/Contents" in some pdf's #8

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

Some pdf's in /Contents have array with one object instead of object directly. 

Eg. /Contents [ 5 0 R ] instead of /Contents 5 0 R.

To fix this problem, I changed buildxobj.py pagexobj method in line 190 to:

    if isinstance(page.Contents, PdfArray):
        contents = page.Contents[0]
    else:
        contents = page.Contents

Original issue reported on code.google.com by exp...@gmail.com on 18 Oct 2012 at 10:12

GoogleCodeExporter commented 9 years ago

Hmmm, this will probably require some additional thought.

Some pages have more than one entry in their content arrays.  For those, it 
would not be useful to simply take the first content array element.

Original comment by pmaupin on 18 Oct 2012 at 10:32

GoogleCodeExporter commented 9 years ago

Ok, I don't know is it the right solution, but at least it works with several 
content streams:

    if isinstance(page.Contents, PdfArray):
        if len(page.Contents) == 1:
            contents = page.Contents[0]
        else:
            # decompress and join multiple streams
            contlist = [c for c in page.Contents]
            uncompress(contlist)
            stream = '\n'.join([c.stream for c in contlist])
            contents = PdfDict(
                Length=len(stream),
                stream=stream
                )
    else:
        contents = page.Contents

Original comment by exp...@gmail.com on 17 Nov 2012 at 3:14

GoogleCodeExporter commented 9 years ago

That makes sense.  The main thing I don't like about it is that it doesn't play 
very well with pdfrw's lack of good compression filter support ;-)

On that note, we probably need to make it barf if the decompression fails.  I 
think the current version of uncompress returns False if it wasn't able to do 
its job -- that should probably cause an exception to be raised here.  
(Otherwise, it will concatenate a still-compressed content dictionary into the 
new dict.)

Thanks for reporting both the bug and most of the fix.

Pat

Original comment by pmaupin on 17 Nov 2012 at 3:57