Pymupdf4llm returns garbage values during parsing a simple page.

pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF

https://pymupdf.readthedocs.io/en/latest/pymupdf4llm

GNU Affero General Public License v3.0

302 stars 57 forks source link

Pymupdf4llm returns garbage values during parsing a simple page. #118

Closed AhsanAli1116 closed 1 month ago

AhsanAli1116 commented 1 month ago

I have a PDF of around 300 pages. When I extract it using pymupdf4llm. It returns bytes like encoded text and miss the information.

JorjMcKie commented 1 month ago

Nothing can be done if you do not include the example file!

AhsanAli1116 commented 1 month ago

For your reference I am using the below function. pymupdf4llm.to_markdown

JorjMcKie commented 1 month ago

One font embedded in your file does not contain a (valid) backtranslation table (CMAP) for the glyphs displayed. Therefore, no meaningful text can be extracted. You also do not use the most current package version, otherwise you would have seen lots of � characters in your output.

AhsanAli1116 commented 1 month ago

thanks

AhsanAli1116 commented 1 month ago

One font embedded in your file does not contain a (valid) backtranslation table (CMAP) for the glyphs displayed. Therefore, no meaningful text can be extracted. You also do not use the most current package version, otherwise you would have seen lots of � characters in your output.

is there any workaround for this?

JorjMcKie commented 1 month ago

No, there is not. You could of course OCR the file, but this would also mean losing other, valuable information like vector graphics. Many of the characters on page also are completely ok.