Closed benisraelnir closed 10 months ago
Sorry for the tardy response! Confirming: The PDF indeed contains a loop in the definition of its structure tree. So the diagnosis (a recent fix in MuPDF) is correct and the exception is justifiable.
It might be subject to interpretation, whether downgrading this problem to a warning would make sense though. We are discussing this.
As a circumvention, put text extraction in a try/except clause. In fact, ignoring the structure tree altogether will also help and text extraction might succeed. Therefore, you could do this:
text = []
for page in doc:
try:
text.append(page.get_text())
except RuntimeError: # make a temporary PDF with the problem page
temp = fitz.open()
temp.insert_pdf(doc, from_page=page.number, to_page=page.number)
text.append(temp[0].get_text())
temp.close()
This will work in your case.
Update from the MuPDF developers: In a future MuPDF version, a cyclic Structure Tree will be disabled or ignored for processing the PDF's contents. In effect leading to the same result as my circumvention above.
With latest PyMuPDF and MuPDF the test case runs ok, with a warning "MuPDF error: cycle in structure tree".
tests/test_2548.py:test_2548() has been extended to check for the new behaviour in PyMuPDF-1.23.7, so marking this as fixed in next release.
Hi! When building pymupdf 1.23.6 against mupdf 1.23.7 I get a failing test:
=================================== FAILURES ===================================
__________________________________ test_2548 ___________________________________
def test_2548():
"""Text extraction should fail because of PDF structure cycle.
Old MuPDF version did not detect the loop.
"""
print(f'test_2548(): {fitz.mupdf_version_tuple=}')
if fitz.mupdf_version_tuple < (1, 23, 4):
print(f'test_2548(): Not testing #2548 because infinite hang before mupdf-1.23.4.')
return
fitz.TOOLS.mupdf_warnings(reset=True)
doc = fitz.open(f'{root}/tests/resources/test_2548.pdf')
e = False
for page in doc:
try:
_ = page.get_text()
except Exception as ee:
print(f'test_2548: {ee=}')
if hasattr(fitz, 'mupdf'):
# Rebased.
expected = "RuntimeError('code=2: cycle in structure tree')"
else:
# Classic.
expected = "RuntimeError('cycle in structure tree')"
assert repr(ee) == expected, f'Expected {expected=} but got {repr(ee)=}.'
e = True
wt = fitz.TOOLS.mupdf_warnings()
print(f'test_2548(): {wt=}')
if fitz.mupdf_version_tuple < (1, 24, 0):
> assert e
E assert False
tests/test_2548.py:35: AssertionError
----------------------------- Captured stdout call -----------------------------
test_2548(): fitz.mupdf_version_tuple=(1, 23, 7)
test_2548(): wt='structure tree broken, assume tree is missing: cycle in structure tree'
=========================== short test summary info ============================
FAILED tests/test_2548.py::test_2548 - assert False
================= 1 failed, 157 passed, 1 deselected in 4.90s ==================
Can you point me to where this test is fixed as for rebuild purposes I will have to disable this now.
Releases of PyMuPDF are only tested with a specific MuPDF, and are not tested or updated to work with later MuPDF releases.
MuPDF often changes behaviour between its releases, so some test failures with later MuPDF releases are to be expected. In particular, PyMuPDF-1.23.6 was only tested with MuPDF-1.23.5.
If you want to use MuPDF-1.23.7, you'll have to wait for our next release, PyMuPDF-1.23.7, which i'm hoping to make today or tomorrow.
[Or you could try the latest PyMuPDF from git, which usually (but not always) works with the latest MuPDF from git (master and current release branche)].
Fixed in 1.23.7.
Hi. In the newest version 1.23.5 am getting this error when reading specific pdfs (link to one of them are in the code below).
To Reproduce
This is my code:
configuration