pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.76k stars 463 forks source link

Missing literals #3669

Closed Bardo-Konrad closed 2 weeks ago

Bardo-Konrad commented 3 weeks ago

Description of the bug

In some documents, get_text outputs the wrong literals in words. For instance the text in the pdf reads "Dort machten die Handelsschiffe auf der Überfahrt" but I get "Dort machten die Handelsschiye auf der Überfahrt". It happens with ff and probably others. When copying from the document in a PDF reader like SumatraPDF, I also get "Dort machten die Handelsschiye auf der Überfahrt".

PyMuPDF version

1.23.x or earlier

Operating system

Windows

Python version

3.11

JorjMcKie commented 3 weeks ago

You did not include a reproducing file and neither any code snippet. So this post does not yet qualify as a bug and we are forced to do guesswork: Your file may use ligatures in the text. "ff" is one of the 6 standard ligatures in Latin text - which means that 1 Unicode (and one glyph) is used to represent multiple characters. By default, ligatures are passed through in text extraction - however, depending on your output device, they should still look ok. You can try with a modified text extraction flag bit combination to confirm. E.g. flags=0. This will dissolve ligatures into their components. For details see documentation.

JorjMcKie commented 2 weeks ago

Closing this for lack of response over an extended time interval. In a future release we will change the text flag default for searches that will no longer preserve ligatures.

Bardo-Konrad commented 2 weeks ago

Thank you for your effort. I changed what you suggested silently, so you were not notified. I apologize for you feeling like your reply was in vain.