Chinese characters wrongly extracted as (cid: xxx)

hugotong6425 commented 2 years ago

Description

When I try to extract text from the pdf, some Chinese characters are recognized as (CID:xxx). pdf: test.pdf

Steps to reproduce the bug:

mkdir pdfminer/cmap
python tools/conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txt
python tools/conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer/cmap Adobe-GB1 cmaprsrc/cid2code_Adobe_GB1.txt
python tools/conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer/cmap Adobe-Japan1 cmaprsrc/cid2code_Adobe_Japan1.txt
python tools/conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer/cmap Adobe-Korea1 cmaprsrc/cid2code_Adobe_Korea1.txt
python setup.py install

python tools/pdf2txt.py samples/test.pdf -o output.html

A screenshot of part of the output from the above code: output_1

Notes

This problem looks similar to https://github.com/pdfminer/pdfminer.six/issues/566, but when I process A0095607-010169.pdf mentioned in https://github.com/pdfminer/pdfminer.six/issues/566, all chinese characters can be extracted correctly.

Please let me know if more information is required to solve this problem! Thanks!

pietermarsman commented 2 years ago

Your PDF has a missing or wrong "ToUnicode" mapping for its glyphs.

A PDF has a separate representations for how to draw glyphs (e.g. the font) and what their meaning is in terms of unicode. You see the glyphs and if you copy-paste the text you get the unicode.

However, your PDF has a missing or wrong unicode mapping. If you copy-paste from the PDF directly you will get gibberish. Pdfminer.six is doing the same thing, but automated, and gets the same result (=gibberish).

Not much we can do (sadly) to improve the results.

Ichiruchan commented 1 year ago

I am very sorry for continuing to ask question in your closed issue. Because I have a related question. Well I have faced the same problem, and I'd like to use OCR to solve it. But I do not want to ocr whole pdf file. I just want to OCR the glyphs of cid char. Is there any way or sdk in pdfminer to get glyph? You can find the detail of question description herehttps://stackoverflow.com/questions/74715436/how-to-extract-text-with-custom-cid-in-pdf?noredirect=1#comment131868765_74715436. Thank for your reply.

pdfminer / pdfminer.six

Chinese characters wrongly extracted as (cid: xxx) #771

Description

Notes