Open awolad-atlasprimer opened 2 years ago
Your PDF is missing a "ToUnicode" mapping for its glyphs.
A PDF has a separate representations for how to draw glyphs (e.g. the font) and what their meaning is in terms of unicode. You see the glyphs and if you copy-paste the text you get the unicode.
However, your PDF is missing the unicode mapping. If you copy-paste from the PDF directly you will get gibberish. Pdfminer.six is doing the same thing, but automated, and gets the same result (=gibberish).
Not much we can do (sadly) to improve the results.
I found something exciting in this issue.
I faced the same issue when I tried to copy the text directly using Mac Preview app and Adobe Acrobat Pro. But when I opened the PDF in Acrobat Pro and went to Edit PDF, I could copy all the text correctly from it.
So there might be a way this issue can be handled, this StackOverflow answer explains it really well on what might be the case.
I'll investigate it further.
I am using Camelot to extract tables from PDFs. It is also extracting similar unknown cid and numbers. But I can copy paste the text without any problem. Working on windows laptop.
I'm not getting texts from the attached PDF. Instead, it's returning a lot of CIDs. I'm using
pdfminer.six==20220319
My code:
Gartner_Reprint.pdf