Closed nikitar closed 1 year ago
Note that the string produced also cannot be passed to Python's own encode
, e.g.
"variability above 2.2 π\udf0e. For the total".encode("utf8")
produces
UnicodeEncodeError: 'utf-8' codec can't encode character '\udf0e' in position 23: surrogates not allowed
It seems that it's uniformly considered invalid.
Thanks for the detailed report.
It seems to be a bug in MuPDF which is being looked at now, so will be fixed in PyMuPDF's next release.
Fixed in 1.23.6.
When extracting text (e.g. with
page.get_text_blocks
), some utf32 characters (e.g.π
- U+1D70E) seem to confuse extraction logic. In that case, the extracted text isπ\udf0e
, which is considered invalid text by some software (DOMParser in my case).I notice that
π
andπ
share the same high surrogate, and\udf0e
is the correct low surrogate. I don't know enough about pdf or unicode to investigate the file itself, but I'm attaching it here (page 5, the final paragraph under the3.3 H.E.S.S.
heading, the entire line isany variability above 2.2 π. For the total data set of 1.8 h, 95% confi-
).There is a similar issue in the final line of the same paragraph (
πΈth = 120 GeV
) and more throughout the document.I am able to access same text correctly with apple's Preview and with google's chrome/pdfium.
2201.00069.pdf
To Reproduce (mandatory)
Your configuration (mandatory)