Closed jn-chrn closed 1 year ago
Thanks for this report and the reproduccer.
I've just pushed a change so that get_textpage()
(and therefore extractDICT()
) defaults to setting the rect to the page's rect, unless a clip rect is explicitly passed in.
This fixes the failure of your test programme, and will be in the next release.
(Note that your test programme fails later on because texts_as_dict["blocks"][0]["lines"][0]["spans"]
is empty.)
Thank you for the fast fix!
Note that your test programme fails later on because texts_as_dict["blocks"][0]["lines"][0]["spans"] is empty.
I tried again and did not get this issue, there are 3 elements in the list of spans when I try locally. Then the bboxes of all these spans are also very large.
Just to make it clear again, there are two issues:
text_page.extractDICT()
), the width and height are invalid@jn-chrn admittedly, this PDF has some very, very unusual specifications and fonts:
fitz.Rect(0,0,0,0)
. And the critical values for character geometry computations, font.ascender
/ font.descender
are unusable, namely equal to the max. C float value - which is the direct reason for computing infinite bboxes.PyMuPDF's get_text("dict",...)
method computes span / line / block boundary boxes as the rectangle unions of the single characters contained therein (which is inevitable for technical reasons). So this explains those infinite reactangles.
The PyMuPDF-specific logic to validate character bboxes can be switched off via fitz.TOOLS.unset_quad_corrections(True)
in which case the original MuPDF computations will prevail.
In this case, this remedy won't work either: The bboxes are no longer infinite, but still crazy enough.
Anyway, if doing get_text(<any-option>, clip=page.rect)
will deliver no text all.
@jn-chrn - just encountered a spot in the code, where character bbox calculation will go wrong if font ascender / descender take on max C float values - which is the case here. I am making progress and will be right back once the situation is clarified.
As mentioned before, it's the fault of those preculiar Type3 fonts. Because they deliver nonsense values for data that are required for bbox computation, some ersatz assumptions must be made. The best result I so far achieve looks like this for your case: The block/line/span bbox (black border) has these values (the blue boxes are single characters):
'bbox': (22.474653244018555,
3.4806418418884277,
34.903072357177734,
8.929698944091797),
To achieve this, the script must use fitz.Tools().set_small_glyph_heights(True)
to enforce corrective bbox / character quad computations ...
Regarding the PDF file itself being unusual, it was created from a much larger file using mutool poster
, so it may have some remains of the original file.
An important note: no large bbox was there with 1.19.6! But with the latest version (1.21.0), we got many of them.
The following code returns, for the bboxes with a width higher than 10^6
:
import fitz
document: fitz.Document = fitz.open(
"crop.pdf"
)
page = list(document.pages())[0]
page_rect: fitz.Rect = page.rect
text_page = page.get_textpage()
texts_as_dict = text_page.extractDICT()
counter = 0
for block in texts_as_dict["blocks"]:
for line in block["lines"]:
direction = line["dir"]
for span in line["spans"]:
quad: fitz.Quad = fitz.recover_quad(line_dir=direction, span=span)
if quad.width > 1e6:
counter += 1
print(counter)
So this small PDFs has many bboxes which are very large with the latest version, but none for older version. This issue only started to occur after 1.19.6.
Regarding the PDF file itself being unusual, it was created from a much larger file using
mutool poster
, so it may have some remains of the original file.
Don't take my comment personal 😉.
You are right, that page obviously is being "cut out" from a much larger one.
There is some problem within the code creating the TextPage
(in MuPDF). In the most current version, the Type3 font is no longer interpreted correctly.
This leads to those crazy large bboxes and character widths. I have developed corrective code in PyMuPDF, which delivers reasonable results, when following this coding pattern:
import fitz
import sys
vsn = f"-{sys.version_info[0]}-{sys.version_info[1]}"
# following ensures using PyMuPDF corrections:
fitz.TOOLS.set_small_glyph_heights(True)
doc = fitz.open("crop.pdf")
page = doc[0]
page.clean_contents() # make sure page.draw_rect() lands in right place
blocks = page.get_text(
"dict",
clip=page.rect, # only look at visible page
flags=fitz.TEXTFLAGS_TEXT, # only look at text
)["blocks"]
for b in blocks:
page.draw_rect(b["bbox"], width=0.2, color=fitz.pdfcolor["green"])
for l in b["lines"]:
for s in l["spans"]:
print(s["text"])
doc.ez_save(f"zdict{vsn}.pdf")
Output:
py testdict.py
km
1.6
And
Internally, I also had to change the decision whether a character should be regarded inside the "clip"
from: "bbox is completely inside clip" to: "character origin is inside clip".
Where "origin" is the bottom left point of a character (glyph) - where drawing of it starts.
I have submitted a related bug in MuPDF's issue system.
Thanks for the insight, and the fast answer (as always)!
Don't take my comment personal
(I had to defend my poor little stupidly made PDF :smile: )
Fixed in PyMuPDF-1.21.1.
Describe the bug
Reading some text from PDF files using
textpage.extractDICT()
returns invalid dimensions with version 1.21.0To Reproduce
To reproduce, please use this piece of code which:
TextPage
from the only page of the documentTextPage
Attached PDF: [crop.pdf]()
Expected behavior
With PyMuPDF version 1.19.6, the size of the extracted bbox was very small. With the newest version, its size became way too large (with a factor of 1e8).
Your configuration
PyMuPDF was installed using
pip install pymupdf
.