pdfsizeopt gets __main__.PdfXrefStreamError: duplicate obj 5

GoogleCodeExporter commented 9 years ago

Someone sent me a PDF that fails with the sequence below.  I pulled 
pdfsizeopt.py from svn today, 12 Oct 2012.  From other debug code, the multiply 
defined object seems to be /ID.  The file also seems to have a stream with no 
objects.  I promised not to post the file.  I have attached patches that might 
help if anyone else has this problem.
William

$ python pdfsizeopt.py 17MB.pdf 
info: This is pdfsizeopt.py- rUNKNOWN size=315256.
info: using Java for Multivalent: /usr/bin/java
info: loading PDF from: 17MB.pdf
info: loaded PDF of 16900425 bytes
info: using Ghostscript gs: GPL Ghostscript 9.06 (2012-08-08)
info: decompressing 40 bytes with Ghostscript /Filter/FlateDecode/DecodeParms 
<</Columns 5/Predictor 12>>
info: decompressing 9536 bytes with Ghostscript /Filter/FlateDecode/DecodeParms 
<</Columns 6/Predictor 12>>
Traceback (most recent call last):
  File "pdfsizeopt.py-", line 7831, in <module>
    main(sys.argv)
  File "pdfsizeopt.py-", line 7793, in main
    ).Load(file_name)
  File "pdfsizeopt.py-", line 3463, in Load
    data, do_ignore_generation_numbers=self.do_ignore_generation_numbers)
  File "pdfsizeopt.py-", line 3805, in ParseUsingXref
    xref_ofs, xref_obj_num, xref_generation)
  File "pdfsizeopt.py-", line 3640, in ParseUsingXrefStream
    raise PdfXrefStreamError('duplicate obj %d' % obj_num)
__main__.PdfXrefStreamError: duplicate obj 5

Original issue reported on code.google.com by william.bader@gmail.com on 12 Oct 2012 at 8:16

Attachments:

pdfsizeopt-12oct12.pat

GoogleCodeExporter commented 9 years ago

Thank you for the bug report and the patch.

I'm hesitating to accept the patch, because it makes pdfsizeopt too permissive, 
and I don't want to pdfsizeopt to accept certain kinds of incorrect PDFs. It 
would help a lot if you could post an example PDF which you think pdfsizeopt 
should accept.

Original comment by pts...@gmail.com on 13 Oct 2012 at 12:06

Added labels: Priority-Medium
Removed labels: Priority-High

GoogleCodeExporter commented 9 years ago

Thanks for looking at the patch.  The person who sent me the file saw my name 
in some patches.  I suggested that he send the file to you.  In any case, I am 
attaching a new patch that is more careful.  In one of the places, instead of 
allowing any duplicate, it permits only /ID.  In the other places, instead of 
continuing silently, it prints a warning to stderr similar the the message that 
it used to raise.
I have a log below that shows the warnings.  If you want, if you send me a 
patch that prints more information, I can run it and let you know what happens.
The file has Creator "Adobe Acrobat 8.1 Combine Files", Producer "Acrobat 
9.3.1", Optimized "no", PDF version "1.6".

Object 5 starts <</ArtBox[42.5197 42.5197 496.063 722.834]
Object 6 starts <</Filter/FlateDecode/Length 619>>stream
Object 3251 starts <</Length 3645/Subtype/XML/Type/Metadata>>stream endstream
Object 11017 starts 
<</Author(Client1)/CreationDate(D:20120910153803+02'00')/Creator(Adobe Acrobat 
8.1 Combine Files)
and another object has stream with /Info 11017 0 R.

info: This is pdfsizeopt.py rUNKNOWN size=315564.
info: using Java for Multivalent: /usr/bin/java
info: loading PDF from: 17MB.pdf
info: loaded PDF of 16900425 bytes
info: using Ghostscript gs: GPL Ghostscript 9.06 (2012-08-08)
info: decompressing 40 bytes with Ghostscript /Filter/FlateDecode/DecodeParms 
<</Columns 5/Predictor 12>>
info: decompressing 9536 bytes with Ghostscript /Filter/FlateDecode/DecodeParms 
<</Columns 6/Predictor 12>>
warning: duplicate obj 5 in xref stream
warning: duplicate obj 6 in xref stream
warning: duplicate obj 3251 in xref stream
warning: duplicate obj 11017 in xref stream
warning: duplicate /ID in xref streams
info: found 11039 obj offsets and 364 obj streams in xref stream
warning: missing offset for xref stream obj 11408
warning: missing xref obj stream 11406
warning: missing xref obj stream 11407
info: separated to 10676 objs + xref + trailer
info: found 0 Type1 fonts loaded
info: found 34 Type1C fonts loaded
info: writing Type1CParser (73664 font bytes) to: pso.conv.parse.tmp.ps
info: executing Type1CParser with Ghostscript: gs -q -dNOPAUSE -dBATCH 
-sDEVICE=nullpage -sDataFile=pso.conv.parsedata.tmp.ps -f pso.conv.parse.tmp.ps
Type1CParser: using interpreter GPL Ghostscript 906 20120808
Type1CParser: all OK

Original comment by william.bader@gmail.com on 13 Oct 2012 at 2:26

Attachments:

pdfsizeopt-20121012.pat

GoogleCodeExporter commented 9 years ago

Thank you very much for the modified and restricted patch.

Without an example PDF I don't have enough information to decide whether the 
patch is an improvement in the general case. (It's definitely an improvement 
for this specific PDF.) So if you can't attach an example PDF, I'm ready to 
apply your patch, but the functionality could be enabled by a command-line flag 
(--do-permissive-obj-parsing) disabled by default. Would this work for you?

Original comment by pts...@gmail.com on 13 Oct 2012 at 11:46

GoogleCodeExporter commented 9 years ago

It is a file that someone sent me.  I do not need it to work, and I have asked 
him to send the file to you.  Since he apparently made the PDF with a recent 
Adobe product, I suspect that other people will have the same problem.  Maybe 
it is better to wait until someone else who is willing to send a PDF has the 
problem.

Original comment by william.bader@gmail.com on 14 Oct 2012 at 2:24

GoogleCodeExporter commented 9 years ago

I have permission to send you the PDF privately for the purpose of checking the 
patches.  Is that OK?
William

Original comment by william.bader@gmail.com on 14 Oct 2012 at 7:13

GoogleCodeExporter commented 9 years ago

Thank you very much for the detailed bug report, the follow-up information and 
the several helpful patches.

Based on the provided example PDF I diagnosed the problem, identified several 
bugs in the xref stream parsing code of pdfsizeopt, and fixed them r220. Please 
download the latest pdfsizeopt.py and check if it works correctly. (It works 
for me.)

It turned out that the example PDF was correct, but pdfsizeopt was parsing it 
incorrectly when both xref streams and /Prev references were involved. I've 
read the relevant sections (3.4.5 and 3.4.7) of the PDF 1.7 reference again, 
and modified pdfsizeopt so that now it works according to the specification.

Original comment by pts...@gmail.com on 14 Oct 2012 at 10:38

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Thanks, 220 works for me.
Regards, William

Original comment by william.bader@gmail.com on 15 Oct 2012 at 12:22

sudharakab / pdfsizeopt

pdfsizeopt gets main.PdfXrefStreamError: duplicate obj 5 #71

sudharakab / pdfsizeopt

pdfsizeopt gets __main__.PdfXrefStreamError: duplicate obj 5 #71

pdfsizeopt gets main.PdfXrefStreamError: duplicate obj 5 #71