pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.56k stars 520 forks source link

Incorrect utf32 text extraction (high & low surrogates are split) #2608

Closed nikitar closed 1 year ago

nikitar commented 1 year ago

When extracting text (e.g. with page.get_text_blocks), some utf32 characters (e.g. 𝜎 - U+1D70E) seem to confuse extraction logic. In that case, the extracted text is πœ‹\udf0e, which is considered invalid text by some software (DOMParser in my case).

I notice that 𝜎 and πœ‹ share the same high surrogate, and \udf0e is the correct low surrogate. I don't know enough about pdf or unicode to investigate the file itself, but I'm attaching it here (page 5, the final paragraph under the 3.3 H.E.S.S. heading, the entire line is any variability above 2.2 𝜎. For the total data set of 1.8 h, 95% confi-).

There is a similar issue in the final line of the same paragraph (𝐸th = 120 GeV) and more throughout the document.

I am able to access same text correctly with apple's Preview and with google's chrome/pdfium.

2201.00069.pdf

To Reproduce (mandatory)

    flags = (fitz.TEXT_DEHYPHENATE | fitz.TEXT_MEDIABOX_CLIP)
    with fitz.open(PDF_PATH) as doc:
        page = doc[4]
        blocks = page.get_text_blocks(flags=flags)
        print(blocks[10])

Your configuration (mandatory)

3.11.3 (v3.11.3:f3909b8bc8, Apr  4 2023, 20:12:10) [Clang 13.0.0 (clang-1300.0.29.30)] 
 darwin 

PyMuPDF 1.22.5: Python bindings for the MuPDF 1.22.2 library.
Version date: 2023-06-21 00:00:01.
Built for Python 3.11 on darwin (64-bit).
nikitar commented 1 year ago

Note that the string produced also cannot be passed to Python's own encode, e.g.

"variability above 2.2 πœ‹\udf0e. For the total".encode("utf8")

produces

UnicodeEncodeError: 'utf-8' codec can't encode character '\udf0e' in position 23: surrogates not allowed

It seems that it's uniformly considered invalid.

julian-smith-artifex-com commented 1 year ago

Thanks for the detailed report.

It seems to be a bug in MuPDF which is being looked at now, so will be fixed in PyMuPDF's next release.

julian-smith-artifex-com commented 1 year ago

Fixed in 1.23.6.