Closed IronK77 closed 1 month ago
Sorry, I have no permission to access this file. Please provide in a different way.
Maybe you can try this lilnk: https://pdfupload.io/docs/7f9b0343
Thanks for the new link. Looking now.
This is not a bug, but a badly written PDF. All text extraction packages cannot read the text successfully from e.g. page 3: Adobe, PDF XChange, etc. There is no way to do anything about this. It has nothing to do with PyMuPDF / PyMuPDF4LLM.
Okay no worries, I was asking the question since I tried some closed source applications and it made texts correct in a docx. (Though there is clear structural issues like misplacing plain texts and textboxes)
I mainly focus on transforming Chinese reports at the moment. I would say in 95% cases there is no problem in recognition, but I have found one with garbled encodings.
https://pdf.dfcfw.com/pdf/H2_AN202407251638299878_1.pdf
The package can do well on a little part of the document (headers for example), but for most contents it just crashed.