Closed hugotong6425 closed 2 years ago
Your PDF has a missing or wrong "ToUnicode" mapping for its glyphs.
A PDF has a separate representations for how to draw glyphs (e.g. the font) and what their meaning is in terms of unicode. You see the glyphs and if you copy-paste the text you get the unicode.
However, your PDF has a missing or wrong unicode mapping. If you copy-paste from the PDF directly you will get gibberish. Pdfminer.six is doing the same thing, but automated, and gets the same result (=gibberish).
Not much we can do (sadly) to improve the results.
I am very sorry for continuing to ask question in your closed issue. Because I have a related question. Well I have faced the same problem, and I'd like to use OCR to solve it. But I do not want to ocr whole pdf file. I just want to OCR the glyphs of cid char. Is there any way or sdk in pdfminer to get glyph? You can find the detail of question description herehttps://stackoverflow.com/questions/74715436/how-to-extract-text-with-custom-cid-in-pdf?noredirect=1#comment131868765_74715436. Thank for your reply.
Description
When I try to extract text from the pdf, some Chinese characters are recognized as (CID:xxx). pdf: test.pdf
Steps to reproduce the bug:
A screenshot of part of the output from the above code:
Notes
This problem looks similar to https://github.com/pdfminer/pdfminer.six/issues/566, but when I process A0095607-010169.pdf mentioned in https://github.com/pdfminer/pdfminer.six/issues/566, all chinese characters can be extracted correctly.
Please let me know if more information is required to solve this problem! Thanks!