PDF/A compliance: ID in file trailer missing or incomplete

GoogleCodeExporter commented 9 years ago

What command do you run to optimize the PDF?
user@ubuntu804server:~/pdfsizeopt$ ./pdfsizeopt.py test.pdf

What does pdfsizeopt display when running the command above?
info: This is pdfsizeopt.py r102.
info: loading PDF from: test.pdf
info: loaded PDF of 16776 bytes
info: separated to 19 objs
info: found 1 Type1 fonts loaded
info: writing Type1CConverter (9062 font bytes) to: pso.conv.tmp.ps
info: executing Type1CConverter with Ghostscript: gs -q -dNOPAUSE -dBATCH
-sDEVI
CE=pdfwrite -dPDFSETTINGS=/printer
-dColorConversionStrategy=/LeaveColorUnchange
d -sOutputFile=pso.conv.tmp.pdf -f pso.conv.tmp.ps
Type1CConverter: using interpreter GPL Ghostscript 861 20071121
Type1CConverter: converting font /PGWBAM+CMR10 to /Obj0000000016
Type1CConverter: all OK
info: loading PDF from: pso.conv.tmp.pdf
info: loaded PDF of 3943 bytes
info: separated to 14 objs
info: found 1 fonts in GS output
info: optimized total Type1 font size 9035 to Type1C font size 895 (10%)
info: optimized Type1 font XObject 16,15: new size=1132 (12%)
info: found 1 Type1C fonts loaded
info: writing Type1CParser (909 font bytes) to: pso.conv.parse.tmp.ps
info: executing Type1CParser with Ghostscript: gs -q -dNOPAUSE -dBATCH
-sDEVICE=
nullpage -sDataFile=pso.conv.parsedata.tmp.ps -f pso.conv.parse.tmp.ps
Type1CParser: using interpreter GPL Ghostscript 861 20071121
Type1CParser: all OK
info: parsed 1 Type1C fonts
info: writing Multivalent input PDF: pso.conv.mi.tmp.pdf
info: saving PDF with 18 objs to: pso.conv.mi.tmp.pdf
info: generated 8290 bytes (49%)
info: executing Multivalent to optimize PDF: java -cp
/home/user/pdfsizeopt/Mult
ivalent.jar tool.pdf.Compress pso.conv.mi.tmp.pdf
file:/home/user/pdfsizeopt/pso.conv.mi.tmp.pdf, 8290 bytes
PDF 1.4, producer=pdfTeX, creator=pdfTeX
additional compression may be possible with:
         -compact
=> new length = 7963, saved 3%, elapsed time = 0 sec
info: Multivalent generated pso.conv.mi.tmp-o.pdf of 7984 bytes (96%)
info: compressed xref stream from 40 to 157 bytes (393%)
info: optimized to 7906 bytes after Multivalent (99%)
info: saving PDF to: test.psom.pdf
info: generated 7906 bytes (47%)

What's wrong with the optimized PDF?
It fails to validate as PDF/A-1b (using acrobat 7.1.0 for the validation).
I get the message:
ID in file trailer missing or incomplete

Original issue reported on code.google.com by lev.bishop on 1 Nov 2009 at 3:32

Attachments:

test.pdf

GoogleCodeExporter commented 9 years ago

Patch:
Index: pdfsizeopt.py
===================================================================
--- pdfsizeopt.py       (revision 102)
+++ pdfsizeopt.py       (working copy)
@@ -3284,7 +3284,7 @@
       trailer_obj.Set('Compress', None)  # emitted by Multivalent.jar
       # Emitted by Multivalent.jar etc., see section 10.3 in
       # pdf_reference_1-7.pdf .
-      trailer_obj.Set('ID', None)
+      # trailer_obj.Set('ID', None)
       assert trailer_obj.head.startswith('<<')
       assert trailer_obj.head.endswith('>>')
       output.append('trailer\n%s\n' % trailer_obj.head)
@@ -5777,7 +5777,7 @@
         # Please note that we save the space of the removed /ID and /Compress
         # below, because /Type/XRef is usually the last object, so we don't
         # need to add padding.
-        pdf_obj.Set('ID', None)
+       # pdf_obj.Set('ID', None)
         pdf_obj.Set('Compress', None)
         if pdf_obj.Get('Index') != None:
           raise NotImplementedError('unexpected /Index in xref object')

Original comment by lev.bishop on 1 Nov 2009 at 5:18

GoogleCodeExporter commented 9 years ago

Thank you for the bug report and the patch.

pdfsizeopt.py doesn't strive for PDF/A compliance. But if all you need is the 
/ID,
please add a command-line flag that enables keeping the ID, turned off by 
default.

Original comment by pts...@gmail.com on 15 Nov 2009 at 9:03

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

In addition to the /ID, it PDF/A requires 1.4 or lower. Therefore, the -old 
option
should be passed to tool.pdf.Compress. However this causes problems that I 
don't yet
understand, so I am still investigating this.

Original comment by lev.bishop on 25 Jan 2010 at 4:57

GoogleCodeExporter commented 9 years ago

It would be nice to add PDF/A compatibility to pdfsizeopt's output -- provided 
that its input PDF is also compliant to PDF/A, and the user explicitly asks for 
PDF/A output by specifying a command-line flag. However, I definitely don't 
want it enabled by default, because it increases the file size.

I'm not starting to add this feature alone. If you'd like to contribute, please 
attach some (preferably tiny) example PDFs to this bug, for which pdfsizeopt.py 
currently doesn't produce PDF/A. I'm closing this bug until you reply.

Do you have a software which checks for PDF/A compatibility? Is there free 
software for that?

Original comment by pts...@gmail.com on 11 Feb 2011 at 2:05

Added labels: Type-Enhancement
Removed labels: Type-Defect

GoogleCodeExporter commented 9 years ago

Thanks for considering this. I would be glad to work with you on getting it 
working. I attach a small file that verifies as PDF/A-1b (using the Acrobat 
9.4.1 preflight tool), the result of running pdfsizeopt --use-multivalent=false 
on this, and the resulting PDF/A-1b conformance failure report from Acrobat. 
The problems are: 
1) ID in file trailer missing or incomplete
2) Syntax problem: Stream dictionary improperly formatted
3) Syntax problem: Stream dictionary has improper length entry
4) Syntax problem: Indirect object “endobj” keyword not preceded by an EOL 
marker
5) Indirect object “endobj” keyword not followed by an EOL marker

As I said in the previous comment, with --use-multivalent=true it would be 
necessary to give the -old option to multivalent, but that breaks other parts 
of pdfsizeopt.py. Perhaps in the first place it would be enough to support only 
-use-multivalent=false for PDF/A.

I have Acrobat Pro 9.4.1 so I can certainly verify any fixes you implement. I'm 
not aware of any free conformance tools, but I can't say that I've looked very 
hard.

Original comment by lev.bishop on 11 Feb 2011 at 3:26

Attachments:

GoogleCodeExporter commented 9 years ago

Sorry, here's the pdfsizeopt output that I forgot to attach

Original comment by lev.bishop on 11 Feb 2011 at 3:28

Attachments:

test1.pso.pdf

GoogleCodeExporter commented 9 years ago

Cool, thanks for the details.

I'm happy to make changes to pdfsizeopt.py so that Acrobat preflight won't 
complain. But since I don't have that software, the most straightforward way is 
that we prepare test input and output file.

I'll implement solutions to complaints 1) ... 5). Stay tuned for an update to 
this bug.

I'll add support to pdfsizeopt.py for generating xref streams, no matter if 
Multivalent is used.

I'll make sure that pdfsizeopt won't use %PDF-1.5 features, and it would fail 
if the input is newer than %PDF-1.4.

I'll to figure out what kind of an /ID should be added if there was none.

I'll also patch pdfsizeopt.py so that it accepts the output of Multivalent 
tool.pdf.Compress -old.

Original comment by pts...@gmail.com on 11 Feb 2011 at 4:41

GoogleCodeExporter commented 9 years ago

Its probably not necessary to add an /ID if there was none, since this would 
mean that the input already did not conform to PDF/A.

Original comment by lev.bishop on 11 Feb 2011 at 4:51

GoogleCodeExporter commented 9 years ago

> Its probably not necessary to add an /ID if there was none, since this would 
mean that the input already did not conform to PDF/A.

You are correct that it's not necessary. But I'd do so anyway, because it's 
just a simple modification to pdfsizeopt.py, and can be helpful just in case.

Original comment by pts...@gmail.com on 11 Feb 2011 at 4:53

GoogleCodeExporter commented 9 years ago

Could you please try if Acrobat preflight accepts /ID[()()] in the trailer 
without complaining? What about /ID[(A)(A)]?

Original comment by pts...@gmail.com on 11 Feb 2011 at 5:14

GoogleCodeExporter commented 9 years ago

Sorry it took me a while to figure out how to do this.
/ID[()()]   : not accepted
/ID[(A)(A)] : accepted

Original comment by lev.bishop on 16 Feb 2011 at 8:04

GoogleCodeExporter commented 9 years ago

Issue 38 has been merged into this issue.

Original comment by pts...@gmail.com on 4 Mar 2011 at 1:43

sudharakab / pdfsizeopt

PDF/A compliance: ID in file trailer missing or incomplete #13