Closed saturosfz closed 4 months ago
This comes from using different default extraction flags. In pymupdf4llm we do not use flag bit pymupdf.TEXT_CID_FOR_UNKNOWN_UNICODE
which causes to use the glyph number as Unicode value, if the Unicode number is not present in the font.
In version 0.0.9 I will revert this. I checked that your file can be processed with this flag bit set.
Fixed in v0.0.9.
36.pdf Just repeating \xef\xbf\xbd