Invalid size of TextPage and bbox with newest version 1.21.0

jn-chrn commented 1 year ago

Describe the bug

Reading some text from PDF files using textpage.extractDICT() returns invalid dimensions with version 1.21.0

To Reproduce

To reproduce, please use this piece of code which:

opens the attached PDF
gets a TextPage from the only page of the document
computes the size of the page for comparison
gets the width and height of the TextPage
- the size of the TextPage is clearly invalid
gets the bbox of the first span inside the first span of the first block
- the bbox dimentsions are clearly invalid

import fitz

document: fitz.Document = fitz.open("crop.pdf")
page = list(document.pages())[0]

page_rect: fitz.Rect = page.rect
text_page = page.get_textpage()
texts_as_dict = text_page.extractDICT()

# The file's size is about 47.4 x 14.0
assert abs(page_rect.width - 47.4) < 0.1
assert abs(page_rect.height - 14.0) < 0.1

# WRONG HERE ALREADY:
# The returned size of the page is '4294967168.0 x 4294967168.0'
assert abs(texts_as_dict["width"] - 47.4) < 0.1
assert abs(texts_as_dict["height"] - 14.0) < 0.1

first_span = texts_as_dict["blocks"][0]["lines"][0]["spans"][0]
bbox = first_span["bbox"]

# The size of the bbox return with version 1.19.6 is:
# '(29.58..., 2.87..., 35.07..., 10.60...)'
assert bbox[2] < 50  # ERROR: returned value '1044369984.0'
assert bbox[3] < 50  # ERROR: returned value '13269935104.0'

Attached PDF: [crop.pdf]()

Expected behavior

With PyMuPDF version 1.19.6, the size of the extracted bbox was very small. With the newest version, its size became way too large (with a factor of 1e8).

Your configuration

print(sys.version, "\n", sys.platform, "\n", fitz.__doc__)
3.10.6 (main, Oct  7 2022, 20:19:58) [GCC 11.2.0] 
 linux 

PyMuPDF 1.21.0: Python bindings for the MuPDF 1.21.0 library.
Version date: 2022-11-08 00:00:01.
Built for Python 3.10 on linux (64-bit).

PyMuPDF was installed using pip install pymupdf.

julian-smith-artifex-com commented 1 year ago

Thanks for this report and the reproduccer.

I've just pushed a change so that get_textpage() (and therefore extractDICT()) defaults to setting the rect to the page's rect, unless a clip rect is explicitly passed in.

This fixes the failure of your test programme, and will be in the next release.

(Note that your test programme fails later on because texts_as_dict["blocks"][0]["lines"][0]["spans"] is empty.)

jn-chrn commented 1 year ago

Thank you for the fast fix!

Note that your test programme fails later on because texts_as_dict["blocks"][0]["lines"][0]["spans"] is empty.

I tried again and did not get this issue, there are 3 elements in the list of spans when I try locally. Then the bboxes of all these spans are also very large.

jn-chrn commented 1 year ago

Just to make it clear again, there are two issues:

at the top level of the dictionary of extracted text (with text_page.extractDICT()), the width and height are invalid
at the level of "span" elements, the bbox is invalid on some PDF files we have, and is invalid on the first span in the attached file

JorjMcKie commented 1 year ago

@jn-chrn admittedly, this PDF has some very, very unusual specifications and fonts:

the MediaBox does not start at (0,0) but at (1063.9544, 1001.37216). The CropBox is identical to the MediaBox.
the relevant fonts are Type3 with invalid font bboxes, fitz.Rect(0,0,0,0). And the critical values for character geometry computations, font.ascender / font.descender are unusable, namely equal to the max. C float value - which is the direct reason for computing infinite bboxes.

PyMuPDF's get_text("dict",...) method computes span / line / block boundary boxes as the rectangle unions of the single characters contained therein (which is inevitable for technical reasons). So this explains those infinite reactangles.

The PyMuPDF-specific logic to validate character bboxes can be switched off via fitz.TOOLS.unset_quad_corrections(True) in which case the original MuPDF computations will prevail. In this case, this remedy won't work either: The bboxes are no longer infinite, but still crazy enough.

Anyway, if doing get_text(<any-option>, clip=page.rect) will deliver no text all.

JorjMcKie commented 1 year ago

@jn-chrn - just encountered a spot in the code, where character bbox calculation will go wrong if font ascender / descender take on max C float values - which is the case here. I am making progress and will be right back once the situation is clarified.

JorjMcKie commented 1 year ago

As mentioned before, it's the fault of those preculiar Type3 fonts. Because they deliver nonsense values for data that are required for bbox computation, some ersatz assumptions must be made. The best result I so far achieve looks like this for your case: The block/line/span bbox (black border) has these values (the blue boxes are single characters):

'bbox': (22.474653244018555,
           3.4806418418884277,
           34.903072357177734,
           8.929698944091797),

To achieve this, the script must use fitz.Tools().set_small_glyph_heights(True) to enforce corrective bbox / character quad computations ...

jn-chrn commented 1 year ago

Regarding the PDF file itself being unusual, it was created from a much larger file using mutool poster, so it may have some remains of the original file.

An important note: no large bbox was there with 1.19.6! But with the latest version (1.21.0), we got many of them.

The following code returns, for the bboxes with a width higher than 10^6:

a count of 309 bboxes with version 1.21.0
a count of 0 bboxes with version 1.19.6

import fitz

document: fitz.Document = fitz.open(
    "crop.pdf"
)
page = list(document.pages())[0]

page_rect: fitz.Rect = page.rect
text_page = page.get_textpage()
texts_as_dict = text_page.extractDICT()

counter = 0
for block in texts_as_dict["blocks"]:
    for line in block["lines"]:
        direction = line["dir"]
        for span in line["spans"]:
            quad: fitz.Quad = fitz.recover_quad(line_dir=direction, span=span)
            if quad.width > 1e6:
                counter += 1

print(counter)

So this small PDFs has many bboxes which are very large with the latest version, but none for older version. This issue only started to occur after 1.19.6.

JorjMcKie commented 1 year ago

Regarding the PDF file itself being unusual, it was created from a much larger file using mutool poster, so it may have some remains of the original file.

Don't take my comment personal 😉. You are right, that page obviously is being "cut out" from a much larger one. There is some problem within the code creating the TextPage (in MuPDF). In the most current version, the Type3 font is no longer interpreted correctly. This leads to those crazy large bboxes and character widths. I have developed corrective code in PyMuPDF, which delivers reasonable results, when following this coding pattern:

import fitz
import sys

vsn = f"-{sys.version_info[0]}-{sys.version_info[1]}"

# following ensures using PyMuPDF corrections:
fitz.TOOLS.set_small_glyph_heights(True)

doc = fitz.open("crop.pdf")
page = doc[0]
page.clean_contents()  # make sure page.draw_rect() lands in right place

blocks = page.get_text(
    "dict",
    clip=page.rect,  # only look at visible page
    flags=fitz.TEXTFLAGS_TEXT,  # only look at text
)["blocks"]
for b in blocks:
    page.draw_rect(b["bbox"], width=0.2, color=fitz.pdfcolor["green"])
    for l in b["lines"]:
        for s in l["spans"]:
            print(s["text"])
doc.ez_save(f"zdict{vsn}.pdf")

Output:

py testdict.py
km

1.6

And grafik Internally, I also had to change the decision whether a character should be regarded inside the "clip" from: "bbox is completely inside clip" to: "character origin is inside clip". Where "origin" is the bottom left point of a character (glyph) - where drawing of it starts.

JorjMcKie commented 1 year ago

I have submitted a related bug in MuPDF's issue system.

jn-chrn commented 1 year ago

Thanks for the insight, and the fast answer (as always)!

Don't take my comment personal

(I had to defend my poor little stupidly made PDF :smile: )

julian-smith-artifex-com commented 1 year ago

Fixed in PyMuPDF-1.21.1.

pymupdf / PyMuPDF