mkl-public / testarea-pdfbox2

Test area for public PDFBox v2 issues on stackoverflow etc
Apache License 2.0
83 stars 44 forks source link

OptimizeAfterMerge breaks PDF file on native Firefox pdf reader #16

Closed elvisbegovic closed 7 months ago

elvisbegovic commented 7 months ago

Hi there,

Considering issue title :

Reproduction :

  1. Starting from this PDF form-empty.pdf

  2. I create a PDF (merged PDF) to which I apply optimize() methode and result is : form-filled.pdf But when you open it on Chrome you can see 22 pages but when opening with Firefox you can only see 4 pages. When open this pdf gile with Adobe Reader you can see 22 pages but if you scroll down you get error 14 after page 3 :

image

  1. If i create same PDF (merged PDF) without optimize() method you can correctly read it on Firefox with 22 pages : form-filled-NO-optimized.pdf

Temporary workaround :

We cannot save this big file size without compression while with optimize() the PDF size is reduced by 5. We keep optimizing and ask user to read pdf on Adobe Reader OR use Chrome/Edge.

Expected behavior :

It seems optimization method is too agressive. How can we enhance optimize() to not breaks firefox-reader or how we should adapt our initial pdf-empty.pdf file to avoid this situation. It seems my initial pdf form-empty.pdf is not created correctly maybe due to copy/pase of AcroForm field... can this be catched/fixed by opzimize method.

Additionnal info :

We have others pdfs similar to form-filled2.pdf that works with optimize() method but this one I can understand why it breaks firefox building pdf reader.

mkl-public commented 7 months ago

This is exactly what I meant in the Words of warning at the end of the associated stack overflow question:

On the other hand this optimizer might already be overly eager in some cases because some duplicates might be needed as separate objects for PDF viewers to accept each instance as an individual entity.

One case in which duplicates are needed as separate objects are pages: multiple identical page objects are forbidden in the PDF page tree. In your case, though, there are a number of identical page objects, every even page is the same.

In a context in which you expect to have multiple identical pages, make those page copies differ a bit. You can do so by adding a custom entry to the page dictionary containing the page number as value of a custom key name.

Please comment on whether or not that sufficed in your case.

(Another such context coming to my mind would be annotations. But you seem to have flattened all annotations beforehand.)

elvisbegovic commented 7 months ago

@mkl-public you make my day! thank you it is OK now. If interested , here you are how i proceed https://github.com/mkl-public/testarea-pdfbox2/pull/17

mkl-public commented 7 months ago

Resolved by your PR #17 .