pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.63k stars 906 forks source link

Getting (cid:20)(cid:21)(cid:18)(cid:19)(cid:23)(cid:18)(cid:21) ... instead of text #746

Open awolad-atlasprimer opened 2 years ago

awolad-atlasprimer commented 2 years ago

I'm not getting texts from the attached PDF. Instead, it's returning a lot of CIDs. I'm using pdfminer.six==20220319

My code:

from pdfminer.high_level import extract_text

pdf='Gartner_Reprint.pdf'
text = extract_text(pdf, page_numbers=[13])

print(text)

Gartner_Reprint.pdf

pietermarsman commented 2 years ago

Your PDF is missing a "ToUnicode" mapping for its glyphs.

A PDF has a separate representations for how to draw glyphs (e.g. the font) and what their meaning is in terms of unicode. You see the glyphs and if you copy-paste the text you get the unicode.

However, your PDF is missing the unicode mapping. If you copy-paste from the PDF directly you will get gibberish. Pdfminer.six is doing the same thing, but automated, and gets the same result (=gibberish).

Not much we can do (sadly) to improve the results.

KunalGehlot commented 1 year ago

I found something exciting in this issue.

I faced the same issue when I tried to copy the text directly using Mac Preview app and Adobe Acrobat Pro. But when I opened the PDF in Acrobat Pro and went to Edit PDF, I could copy all the text correctly from it.

So there might be a way this issue can be handled, this StackOverflow answer explains it really well on what might be the case.

I'll investigate it further.

deveshcse commented 1 month ago

I am using Camelot to extract tables from PDFs. It is also extracting similar unknown cid and numbers. But I can copy paste the text without any problem. Working on windows laptop.