pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.46k stars 513 forks source link

Invalid size of TextPage and bbox with newest version 1.21.0 #2048

Closed jn-chrn closed 1 year ago

jn-chrn commented 1 year ago

Describe the bug

Reading some text from PDF files using textpage.extractDICT() returns invalid dimensions with version 1.21.0

To Reproduce

To reproduce, please use this piece of code which:

import fitz

document: fitz.Document = fitz.open("crop.pdf")
page = list(document.pages())[0]

page_rect: fitz.Rect = page.rect
text_page = page.get_textpage()
texts_as_dict = text_page.extractDICT()

# The file's size is about 47.4 x 14.0
assert abs(page_rect.width - 47.4) < 0.1
assert abs(page_rect.height - 14.0) < 0.1

# WRONG HERE ALREADY:
# The returned size of the page is '4294967168.0 x 4294967168.0'
assert abs(texts_as_dict["width"] - 47.4) < 0.1
assert abs(texts_as_dict["height"] - 14.0) < 0.1

first_span = texts_as_dict["blocks"][0]["lines"][0]["spans"][0]
bbox = first_span["bbox"]

# The size of the bbox return with version 1.19.6 is:
# '(29.58..., 2.87..., 35.07..., 10.60...)'
assert bbox[2] < 50  # ERROR: returned value '1044369984.0'
assert bbox[3] < 50  # ERROR: returned value '13269935104.0'

Attached PDF: [crop.pdf]()

Expected behavior

With PyMuPDF version 1.19.6, the size of the extracted bbox was very small. With the newest version, its size became way too large (with a factor of 1e8).

Your configuration

print(sys.version, "\n", sys.platform, "\n", fitz.__doc__)
3.10.6 (main, Oct  7 2022, 20:19:58) [GCC 11.2.0] 
 linux 

PyMuPDF 1.21.0: Python bindings for the MuPDF 1.21.0 library.
Version date: 2022-11-08 00:00:01.
Built for Python 3.10 on linux (64-bit).

PyMuPDF was installed using pip install pymupdf.

julian-smith-artifex-com commented 1 year ago

Thanks for this report and the reproduccer.

I've just pushed a change so that get_textpage() (and therefore extractDICT()) defaults to setting the rect to the page's rect, unless a clip rect is explicitly passed in.

This fixes the failure of your test programme, and will be in the next release.

(Note that your test programme fails later on because texts_as_dict["blocks"][0]["lines"][0]["spans"] is empty.)

jn-chrn commented 1 year ago

Thank you for the fast fix!

Note that your test programme fails later on because texts_as_dict["blocks"][0]["lines"][0]["spans"] is empty.

I tried again and did not get this issue, there are 3 elements in the list of spans when I try locally. Then the bboxes of all these spans are also very large.

jn-chrn commented 1 year ago

Just to make it clear again, there are two issues:

JorjMcKie commented 1 year ago

@jn-chrn admittedly, this PDF has some very, very unusual specifications and fonts:

  1. the MediaBox does not start at (0,0) but at (1063.9544, 1001.37216). The CropBox is identical to the MediaBox.
  2. the relevant fonts are Type3 with invalid font bboxes, fitz.Rect(0,0,0,0). And the critical values for character geometry computations, font.ascender / font.descender are unusable, namely equal to the max. C float value - which is the direct reason for computing infinite bboxes.

PyMuPDF's get_text("dict",...) method computes span / line / block boundary boxes as the rectangle unions of the single characters contained therein (which is inevitable for technical reasons). So this explains those infinite reactangles.

The PyMuPDF-specific logic to validate character bboxes can be switched off via fitz.TOOLS.unset_quad_corrections(True) in which case the original MuPDF computations will prevail. In this case, this remedy won't work either: The bboxes are no longer infinite, but still crazy enough.

Anyway, if doing get_text(<any-option>, clip=page.rect) will deliver no text all.

JorjMcKie commented 1 year ago

@jn-chrn - just encountered a spot in the code, where character bbox calculation will go wrong if font ascender / descender take on max C float values - which is the case here. I am making progress and will be right back once the situation is clarified.

JorjMcKie commented 1 year ago

As mentioned before, it's the fault of those preculiar Type3 fonts. Because they deliver nonsense values for data that are required for bbox computation, some ersatz assumptions must be made. The best result I so far achieve looks like this for your case: image The block/line/span bbox (black border) has these values (the blue boxes are single characters):

'bbox': (22.474653244018555,
           3.4806418418884277,
           34.903072357177734,
           8.929698944091797),

To achieve this, the script must use fitz.Tools().set_small_glyph_heights(True) to enforce corrective bbox / character quad computations ...

jn-chrn commented 1 year ago

Regarding the PDF file itself being unusual, it was created from a much larger file using mutool poster, so it may have some remains of the original file.


An important note: no large bbox was there with 1.19.6! But with the latest version (1.21.0), we got many of them.

The following code returns, for the bboxes with a width higher than 10^6:

import fitz

document: fitz.Document = fitz.open(
    "crop.pdf"
)
page = list(document.pages())[0]

page_rect: fitz.Rect = page.rect
text_page = page.get_textpage()
texts_as_dict = text_page.extractDICT()

counter = 0
for block in texts_as_dict["blocks"]:
    for line in block["lines"]:
        direction = line["dir"]
        for span in line["spans"]:
            quad: fitz.Quad = fitz.recover_quad(line_dir=direction, span=span)
            if quad.width > 1e6:
                counter += 1

print(counter)

So this small PDFs has many bboxes which are very large with the latest version, but none for older version. This issue only started to occur after 1.19.6.

JorjMcKie commented 1 year ago

Regarding the PDF file itself being unusual, it was created from a much larger file using mutool poster, so it may have some remains of the original file.

Don't take my comment personal 😉. You are right, that page obviously is being "cut out" from a much larger one. There is some problem within the code creating the TextPage (in MuPDF). In the most current version, the Type3 font is no longer interpreted correctly. This leads to those crazy large bboxes and character widths. I have developed corrective code in PyMuPDF, which delivers reasonable results, when following this coding pattern:

import fitz
import sys

vsn = f"-{sys.version_info[0]}-{sys.version_info[1]}"

# following ensures using PyMuPDF corrections:
fitz.TOOLS.set_small_glyph_heights(True)

doc = fitz.open("crop.pdf")
page = doc[0]
page.clean_contents()  # make sure page.draw_rect() lands in right place

blocks = page.get_text(
    "dict",
    clip=page.rect,  # only look at visible page
    flags=fitz.TEXTFLAGS_TEXT,  # only look at text
)["blocks"]
for b in blocks:
    page.draw_rect(b["bbox"], width=0.2, color=fitz.pdfcolor["green"])
    for l in b["lines"]:
        for s in l["spans"]:
            print(s["text"])
doc.ez_save(f"zdict{vsn}.pdf")

Output:

py testdict.py
km

1.6

And grafik Internally, I also had to change the decision whether a character should be regarded inside the "clip" from: "bbox is completely inside clip" to: "character origin is inside clip". Where "origin" is the bottom left point of a character (glyph) - where drawing of it starts.

JorjMcKie commented 1 year ago

I have submitted a related bug in MuPDF's issue system.

jn-chrn commented 1 year ago

Thanks for the insight, and the fast answer (as always)!


Don't take my comment personal

(I had to defend my poor little stupidly made PDF :smile: )

julian-smith-artifex-com commented 1 year ago

Fixed in PyMuPDF-1.21.1.