Closed Bardo-Konrad closed 2 weeks ago
You did not include a reproducing file and neither any code snippet. So this post does not yet qualify as a bug and we are forced to do guesswork: Your file may use ligatures in the text. "ff" is one of the 6 standard ligatures in Latin text - which means that 1 Unicode (and one glyph) is used to represent multiple characters. By default, ligatures are passed through in text extraction - however, depending on your output device, they should still look ok. You can try with a modified text extraction flag bit combination to confirm. E.g. flags=0. This will dissolve ligatures into their components. For details see documentation.
Closing this for lack of response over an extended time interval. In a future release we will change the text flag default for searches that will no longer preserve ligatures.
Thank you for your effort. I changed what you suggested silently, so you were not notified. I apologize for you feeling like your reply was in vain.
Description of the bug
In some documents, get_text outputs the wrong literals in words. For instance the text in the pdf reads "Dort machten die Handelsschiffe auf der Überfahrt" but I get "Dort machten die Handelsschiye auf der Überfahrt". It happens with ff and probably others. When copying from the document in a PDF reader like SumatraPDF, I also get "Dort machten die Handelsschiye auf der Überfahrt".
PyMuPDF version
1.23.x or earlier
Operating system
Windows
Python version
3.11