pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.54k stars 447 forks source link

subset_fonts error exit without exception/warning #3470

Closed ragebear00 closed 1 week ago

ragebear00 commented 1 month ago

Description of the bug

in the new PyMUPDF 1.24.3, if any error in doc.subset_fonts(), the process will end without any warning or error number. doc.subset_fonts() Error will be raised in PyMUPdf 1.23.26.

How to reproduce the bug

In PyMUPdf 1.23.26 Traceback (most recent call last): File "C:_a\PDF_Searchable_v1.py", line 346, in pdfSearhable4 doc.subset_fonts() File "C:\Users\6\AppData\Local\Programs\Python\Python310\lib\site-packages\fitz\utils.py", line 5631, in subset_fonts width_table, def_width = get_old_widths(font_xref) File "C:\Users\6\AppData\Local\Programs\Python\Python310\lib\site-packages\fitz\utils.py", line 5350, in get_old_widths df_xref = int(df[1][1:-1].replace("0 R", "")) ValueError: invalid literal for int() with base 10: '<</BaseFont/CIDFont+F1/CIDSystemInfo<</Ordering 97 /Registry 98 /Supplement 0>>/CIDToGIDMap/Identity/FontDescriptor<</Ascent 952/CapHeight 631/Descent -268/Flags 6/FontBBox 99 /FontFile2 100 /FontNam

PyMuPDF version

1.24.3

Operating system

Windows

Python version

3.10

JorjMcKie commented 1 month ago

This post cannot be accepted with a reproducing file. To circumvent an urgent situation, please use argument fallback=True.

ragebear00 commented 1 month ago

try to run doc.subset_fonts in the attached file will create an error in an 1 - Copy.pdf earlier version.

Under with fallback, the doc.subset_fonts will raise the same error.

Under new version(without fallback), the error will not be raised, but the file doc.save after doc.subset_fonts will scramble the words.

cbm755 commented 1 month ago

I can reproduce the previous comment:

In [2]: fitz.version
Out[2]: ('1.23.3', '1.23.2', '20230831000001')

In [3]: d = fitz.open("1.-.Copy.pdf")

In [4]: d.subset_fonts()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], line 1
----> 1 d.subset_fonts()

File /usr/lib64/python3.12/site-packages/fitz/utils.py:5448, in subset_fonts(doc, verbose)
   5445 # walk through the original font xrefs and replace each by the subset def
   5446 for font_xref in xref_set:
   5447     # we need the original '/W' and '/DW' width values
-> 5448     width_table, def_width = get_old_widths(font_xref)
   5449     # ... and replace original font definition at xref with it
   5450     doc.update_object(font_xref, font_str)

File /usr/lib64/python3.12/site-packages/fitz/utils.py:5175, in subset_fonts.<locals>.get_old_widths(xref)
   5173 if df[0] != "array":  # only handle xref specifications
   5174     return None, None
-> 5175 df_xref = int(df[1][1:-1].replace("0 R", ""))
   5176 widths = doc.xref_get_key(df_xref, "W")
   5177 if widths[0] != "array":  # no widths key found

ValueError: invalid literal for int() with base 10: '<</BaseFont/CIDFont+F1/CIDSystemInfo<</Ordering 13 /Registry 14 /Supplement 0>>/CIDToGIDMap/Identity/FontDescriptor<</Ascent 952/CapHeight 631/Descent -268/Flags 6/FontBBox 15 /FontFile2 16 /FontName

But with 1.24.3, I get no error and upon save I see scrambled words: image

JorjMcKie commented 1 month ago

The MuPDF team has developed a fix which I am currently testing.

JorjMcKie commented 1 month ago

Update: fix developed.

cbm755 commented 1 month ago

I have a possibly-related issue where 1.24.3 leaves some misc chars on the page, which go away if I stop using subset_fonts. Haven't narrowed it down to a MWE yet, but one difference is I DO NOT get an error with older pymupdf: so it might not be quite the same issue... More to follow.

Downstream issue: https://gitlab.com/plom/plom/-/issues/3374

julian-smith-artifex-com commented 1 week ago

Fixed in 1.24.6.