Closed hrhktkbzyy closed 2 months ago
This is no bug, but goes back to properties / deficiencies of the used font(s). If a glyph contains no back-reference to the Unicode that originated it, then there is no way to determine the Unicode. This is what is happening in every case where a � appears.
In addition, PyMuPDF's default extraction flags use the glyph number instead of the Unicode then the Unicode's value is 0xFFFD
(which delivers that �). So you can try the etraction using flags=0 and see what happens instead.
But as you report: when other extractors also deliver crab, then we have just bad luck!
Description of the bug
When extracting text from the attached PDF, the output contains garbled characters. Additionally, when I tried copying & pasting the content from other PDF viewers, similar issues occurred.
I'm unsure if this is related to encoding settings or if there is a way to correct this behavior. Any guidance or potential fixes would be appreciated.
How to reproduce the bug
PDF is as attached
king arthur.pdf The Phantom of the Opera.pdf
Code Sample:
Output:
The extracted content includes cid values such as:
PyMuPDF version
1.24.9
Operating system
MacOS
Python version
3.12