Closed AhsanAli1116 closed 1 month ago
Nothing can be done if you do not include the example file!
For your reference I am using the below function. pymupdf4llm.to_markdown
One font embedded in your file does not contain a (valid) backtranslation table (CMAP) for the glyphs displayed. Therefore, no meaningful text can be extracted. You also do not use the most current package version, otherwise you would have seen lots of � characters in your output.
thanks
One font embedded in your file does not contain a (valid) backtranslation table (CMAP) for the glyphs displayed. Therefore, no meaningful text can be extracted. You also do not use the most current package version, otherwise you would have seen lots of � characters in your output.
is there any workaround for this?
No, there is not. You could of course OCR the file, but this would also mean losing other, valuable information like vector graphics. Many of the characters on page also are completely ok.
I have a PDF of around 300 pages. When I extract it using pymupdf4llm. It returns bytes like encoded text and miss the information.