Garbled code on Chinese reports

pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF

https://pymupdf.readthedocs.io/en/latest/pymupdf4llm

GNU Affero General Public License v3.0

302 stars 57 forks source link

Garbled code on Chinese reports #92

Closed IronK77 closed 1 month ago

IronK77 commented 2 months ago

I mainly focus on transforming Chinese reports at the moment. I would say in 95% cases there is no problem in recognition, but I have found one with garbled encodings.

https://pdf.dfcfw.com/pdf/H2_AN202407251638299878_1.pdf

The package can do well on a little part of the document (headers for example), but for most contents it just crashed.

JorjMcKie commented 1 month ago

Sorry, I have no permission to access this file. Please provide in a different way.

IronK77 commented 1 month ago

Maybe you can try this lilnk: https://pdfupload.io/docs/7f9b0343

JorjMcKie commented 1 month ago

Thanks for the new link. Looking now.

JorjMcKie commented 1 month ago

This is not a bug, but a badly written PDF. All text extraction packages cannot read the text successfully from e.g. page 3: Adobe, PDF XChange, etc. There is no way to do anything about this. It has nothing to do with PyMuPDF / PyMuPDF4LLM.

IronK77 commented 1 month ago

Okay no worries, I was asking the question since I tried some closed source applications and it made texts correct in a docx. (Though there is clear structural issues like misplacing plain texts and textboxes)