benisraelnir commented 11 months ago

Hi. In the newest version 1.23.5 am getting this error when reading specific pdfs (link to one of them are in the code below).

[c:\Workspace\ikido-data-science\venv\Lib\site-packages\fitz\fitz.py](file:///C:/Workspace/ikido-data-science/venv/Lib/site-packages/fitz/fitz.py) in ?(self, clip, flags, matrix)
   6002     def _get_textpage(self, clip=None, flags=0, matrix=None):
-> 6003         val = _fitz.Page__get_textpage(self, clip, flags, matrix)
   6004         val.thisown = True
   6005 
   6006         return val

RuntimeError: cycle in structure tree

To Reproduce

This is my code:

import io
import requests
import fitz
pdf_link = 'https://app.ikido.tech/api/datasheet/b738958aeedfcc7efee127e5fea0a6b483e4022ac562c16473ab89af7ef0cd9444f6c1c884a398362c56863ce9ea6cbedc9a005f44facaa990c567a9d08ddd95/AVXC-S-A0014478402-1.pdf'
request = requests.get(pdf_link, timeout=20)
filestream = io.BytesIO(request.content)
text = []
with fitz.open(stream=filestream, filetype="pdf") as doc: 
    doc[0].get_text()

configuration

both on windows and linux

JorjMcKie commented 11 months ago

Sorry for the tardy response! Confirming: The PDF indeed contains a loop in the definition of its structure tree. So the diagnosis (a recent fix in MuPDF) is correct and the exception is justifiable.

It might be subject to interpretation, whether downgrading this problem to a warning would make sense though. We are discussing this.

As a circumvention, put text extraction in a try/except clause. In fact, ignoring the structure tree altogether will also help and text extraction might succeed. Therefore, you could do this:

text = []
for page in doc:
    try:
        text.append(page.get_text())
    except RuntimeError:  # make a temporary PDF with the problem page
        temp = fitz.open()
        temp.insert_pdf(doc, from_page=page.number, to_page=page.number)
        text.append(temp[0].get_text())
        temp.close()

This will work in your case.

JorjMcKie commented 11 months ago

Update from the MuPDF developers: In a future MuPDF version, a cyclic Structure Tree will be disabled or ignored for processing the PDF's contents. In effect leading to the same result as my circumvention above.

julian-smith-artifex-com commented 10 months ago

With latest PyMuPDF and MuPDF the test case runs ok, with a warning "MuPDF error: cycle in structure tree".

2548 tests the same issue.

julian-smith-artifex-com commented 10 months ago

tests/test_2548.py:test_2548() has been extended to check for the new behaviour in PyMuPDF-1.23.7, so marking this as fixed in next release.

dvzrv commented 10 months ago

Hi! When building pymupdf 1.23.6 against mupdf 1.23.7 I get a failing test:

=================================== FAILURES ===================================
__________________________________ test_2548 ___________________________________

    def test_2548():
        """Text extraction should fail because of PDF structure cycle.

        Old MuPDF version did not detect the loop.
        """
        print(f'test_2548(): {fitz.mupdf_version_tuple=}')
        if fitz.mupdf_version_tuple < (1, 23, 4):
            print(f'test_2548(): Not testing #2548 because infinite hang before mupdf-1.23.4.')
            return
        fitz.TOOLS.mupdf_warnings(reset=True)
        doc = fitz.open(f'{root}/tests/resources/test_2548.pdf')
        e = False
        for page in doc:
            try:
                _ = page.get_text()
            except Exception as ee:
                print(f'test_2548: {ee=}')
                if hasattr(fitz, 'mupdf'):
                    # Rebased.
                    expected = "RuntimeError('code=2: cycle in structure tree')"
                else:
                    # Classic.
                    expected = "RuntimeError('cycle in structure tree')"
                assert repr(ee) == expected, f'Expected {expected=} but got {repr(ee)=}.'
                e = True
        wt = fitz.TOOLS.mupdf_warnings()
        print(f'test_2548(): {wt=}')
        if fitz.mupdf_version_tuple < (1, 24, 0):
>           assert e
E           assert False

tests/test_2548.py:35: AssertionError
----------------------------- Captured stdout call -----------------------------
test_2548(): fitz.mupdf_version_tuple=(1, 23, 7)
test_2548(): wt='structure tree broken, assume tree is missing: cycle in structure tree'
=========================== short test summary info ============================
FAILED tests/test_2548.py::test_2548 - assert False
================= 1 failed, 157 passed, 1 deselected in 4.90s ==================

Can you point me to where this test is fixed as for rebuild purposes I will have to disable this now.

julian-smith-artifex-com commented 10 months ago

Releases of PyMuPDF are only tested with a specific MuPDF, and are not tested or updated to work with later MuPDF releases.

MuPDF often changes behaviour between its releases, so some test failures with later MuPDF releases are to be expected. In particular, PyMuPDF-1.23.6 was only tested with MuPDF-1.23.5.

If you want to use MuPDF-1.23.7, you'll have to wait for our next release, PyMuPDF-1.23.7, which i'm hoping to make today or tomorrow.

[Or you could try the latest PyMuPDF from git, which usually (but not always) works with the latest MuPDF from git (master and current release branche)].

julian-smith-artifex-com commented 10 months ago

Fixed in 1.23.7.

pymupdf / PyMuPDF

RuntimeError: cycle in structure tree #2749

To Reproduce

configuration

2548 tests the same issue.