Unexpected results in pymupdf4llm but pymupdf works

pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF

https://pymupdf.readthedocs.io/en/latest/pymupdf4llm

GNU Affero General Public License v3.0

539 stars 82 forks source link

Unexpected results in pymupdf4llm but pymupdf works #71

Closed saturosfz closed 4 months ago

saturosfz commented 4 months ago

36.pdf Just repeating \xef\xbf\xbd

JorjMcKie commented 4 months ago

This comes from using different default extraction flags. In pymupdf4llm we do not use flag bit pymupdf.TEXT_CID_FOR_UNKNOWN_UNICODE which causes to use the glyph number as Unicode value, if the Unicode number is not present in the font.

In version 0.0.9 I will revert this. I checked that your file can be processed with this flag bit set.

JorjMcKie commented 4 months ago

Fixed in v0.0.9.