py-pdf / fpdf2

Simple PDF generation for Python
https://py-pdf.github.io/fpdf2/
GNU Lesser General Public License v3.0
1.05k stars 241 forks source link

Acrobat: An error exists on this page. (with multiple SVG imports) #960

Closed gmischler closed 5 months ago

gmischler commented 11 months ago

While implementing "image paragraphs" for text regions, Acrobat reader suddenly started complaining about my test file: image Of course they want you to buy their other software to create PDFs, so the message is deliberately unhelpful.

Error details

I could boil it down to sections containing imported SVG data. Strangely it takes a certain amount of data until the error triggers. With the SVG logo, it either takes three of them on one page, or two and a bunch of text (at least that are the combinations I found). None of the other viewers and validators that I have easy access to indicate any errors.

Processing the file with qpdf and "--normalize-content=y" (or "--qdf") fixes the problem. But I was unable to glean any useful information from a comparison. I've seen reports that Adobe Preflight gives useful and detailed error reports. So if anyone has that available, it might lead us somewhere.

Minimal code

from fpdf import FPDF

img_file = "fpdf2/test/svg/svg_sources/SVG_logo.svg"
pdf = FPDF()
pdf.add_page()
pdf.image(img_file, w=30, h=30)
pdf.image(img_file, w=30, h=30)
pdf.image(img_file, w=30, h=30)
pdf.output("acro-svg.pdf")

(for some reason, github doesn't want me to include PDF files here...)

Environment

Lucas-C commented 11 months ago

Thank you for the detailed report @gmischler!

I made some tests this morning:

gmischler commented 11 months ago

It's probably not something in the SVG data itself, but in how it interacts with compression. Adding the same SVG several times causes a lot of repetition in the text (they end up identical except for the placement/scaling transform), resulting in a very high compression ratio. Apparently we're not handling that situation in exactly the way as the acrobat reader expects.

I've found that some other software sometimes adds a "Length1" value to content streams. By the specs this is only meant (and mandatory) for compressed font data, where it gives the uncompressed size of the data. I experimented with adding that to the content stream of my example file, but didn't see any change in behaviour. Given that it is off-spec, that isn't really a surprise, but it was worth a shot.

Acrobat reader seems to issue (or not) those warnings depending on arbitrary criteria (including the Windows version, according to some reports). So it may well be that there's something in our use of compression it generally doesn't like, but only complains about when the compression rate is particularly high.

Lucas-C commented 10 months ago

In fpdf2, PDF pages are compressed using /FlateDecode implemented with zlib.compress(): https://github.com/py-pdf/fpdf2/blob/2.7.6/fpdf/syntax.py#L200

Have you tried displaying zlib.ZLIB_VERSION & zlib.ZLIB_RUNTIME_VERSION? Maybe this issue could be related to the version of the underlying zlib library used?

I'd be curious to know if this could problem happens with other PDF readers... Adobe Acrobat Reader being closed-source, it won't be easy to figure what is the root problem...

Lucas-C commented 10 months ago

I have been digging a little deeper into the resulting zlib compressed streams, but could not find much...

import zlib
from fpdf import FPDF
from pypdf import PdfReader

for svg_file in ("test/svg/svg_sources/arcs01.svg", "test/svg/svg_sources/arcs02.svg"):
  print(svg_file)

  pdf = FPDF()
  pdf.add_page()
  pdf.image(svg_file, w=30, h=30)
  pdf.image(svg_file, w=30, h=30)
  pdf.image(svg_file, w=30, h=30)
  pdf.output("issue_960.pdf")

  reader = PdfReader("issue_960.pdf")
  compressed_stream = reader.pages[0]["/Contents"]._data

  # cf. https://www.rfc-editor.org/rfc/rfc1950
  cmf, flg = compressed_stream[0], compressed_stream[1]
  print(f"* cmf=0x{cmf:X} flg=0x{flg:X}")  # 0x78 0x9C => zlib: Default Compression

  decompressor = zlib.decompressobj(wbits=zlib.MAX_WBITS)
  decompressed_data = decompressor.decompress(compressed_stream)
  print(f"* length of decompressed data: {len(decompressed_data)} bytes")
  print(f"* compression ratio: {100*len(compressed_stream)/len(decompressed_data):.2f}%")
  print(f"* end of the compressed data stream reached? {decompressor.eof=}")
  print(f"* {decompressor.unconsumed_tail=}")
  print(f"* {decompressor.unused_data=}")
  print()

Output:

test/svg/svg_sources/arcs01.svg
* cmf=0x78 flg=0x9C
* length of decompressed data: 2585 bytes
* compression ratio: 17.45%
* end of the compressed data stream reached? decompressor.eof=True
* decompressor.unconsumed_tail=b''
* decompressor.unused_data=b''

test/svg/svg_sources/arcs02.svg
* cmf=0x78 flg=0x9C
* length of decompressed data: 7808 bytes
* compression ratio: 4.85%
* end of the compressed data stream reached? decompressor.eof=True
* decompressor.unconsumed_tail=b''
* decompressor.unused_data=b''

The compression ratio of the smallest "problematic" SVG file (test/svg/svg_sources/arcs02.svg) is lower than test/svg/svg_sources/arcs01.svg which does not cause any problem, so it's not simply a matter of this ratio being "too high".

You are right @gmischler, this problems really seems correlated with a high compression ratio being used:

Compression ratio for test/svg/svg_sources/Ghostscript_escher.svg (OK): 29.12%
Compression ratio for test/svg/svg_sources/Ghostscript_colorcircle.svg (OK): 33.11%
Compression ratio for test/svg/svg_sources/cubic01.svg (OK): 9.62%
Compression ratio for test/svg/svg_sources/quad01.svg (OK): 11.95%
Compression ratio for test/svg/svg_sources/arcs01.svg (OK): 17.45%

Compression ratio for test/svg/svg_sources/cubic02.svg (KO): 7.66%
Compression ratio for test/svg/svg_sources/SVG_logo.svg (KO): 6.07%
Compression ratio for test/svg/svg_sources/arcs02.svg (KO): 4.85%
Lucas-C commented 10 months ago

I suspect that Adobe Acrobat Reader decompression function is implemented a bit like that, for "safety" reasons:

import zlib

def acrobat_decompress(compressed_data, growth_max=12):
    max_length = len(compressed_data) * growth_max
    decompressor = zlib.decompressobj()
    decompressed_data = decompressor.decompress(compressed_data, max_length=max_length)
    if not decompressor.eof:
          raise RuntimeError(f"Uncompressed content is at least {growth_max} times bigger than compressed data")
    return decompressed_data

Of course, len(compressed_data) * 12 is just a guess, who knows what the actual implementation sets as the limit...

Lucas-C commented 10 months ago

I made some extra tests with several source SVG files:

So it's not just a maximum ratio that is taken in consideration by Acrobat...

Lucas-C commented 10 months ago

Maybe fpdf2 should produce a warning when a content stream is compressed with a compression ratio lower than 10%?

gmischler commented 10 months ago

Zlib comes with Python. My 3.10 installation uses 1.2.11, but I doubt that this makes any difference in the output.

A warning from fpdf2 seems a bit pointless as long as we don't know what the problem is. What is the user supposed to do with it?

Do all the affected files contain SVG data? I've tried to reproduce the error with other repetitive content subject to high compression, with no success. So it could still be some subtlety in the graphics commands, which acrobat only complains about under certain arbitrary circumstances.

It would really be helpful if soeone with Acrobat Pro could run those files through the preflight function. If the problem is real (and not just a viewer bug), that would give us the information directly from the horses mouth.

GerardoAllende commented 5 months ago

When I use Acrobat, I get the same error when printing a PDF. The only requirement is that there is a "path" in the code.

Minimal test code:

from fpdf import FPDF

pdf = FPDF()
pdf.add_page()

with pdf.new_path() as path:
    path.move_to(1, 1)
    path.line_to(9, 9)
    path.close()

pdf.output("test.pdf")

Then print test.pdf using Acrobat reader. The error should appear right after printing. test.pdf

The problem persists when pdf.compress = False

Lucas-C commented 5 months ago

When I use Acrobat, I get the same error when printing a PDF. The only requirement is that there is a "path" in the code.

I think this is a different problem, so I moved your comment into a dedicated issue 🙂

GerardoAllende commented 5 months ago

Different problem, same workaround -> #1144 also fixes this one. Just comment these lines https://github.com/py-pdf/fpdf2/edit/master/fpdf/drawing.py#L1448-L1454 Results: acro-svg-workaround.pdf acro-svg-err.pdf

AurelianTimu commented 4 months ago

I was having exactly the same issue when using SVGs. It was not 100% reproducible, and happening rarely.. I tried locally the fix in #1145 and so far in my testing I haven't seen the issue again.

Is there an ETA to land 2.7.9 on pypi?

Lucas-C commented 4 months ago

Is there an ETA to land 2.7.9 on pypi?

If @gmischler & @andersonhc agree, I think we could perform a new release this month! 🙂