jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Strange characters when extracting text #274

Closed dasapa closed 4 years ago

dasapa commented 4 years ago

Hello,

I have extracted data from several invoices, everything has gone perfectly, but with one when I extract text it shows characters like the following:

**_++IIKKB=B?=J?KJHK=HJ=@ +IKJH=!="&&  +IKJH= !=$#'!    +@A=ECK=+=@A=CK= -LK?E?IK -LK?E?IK 6EFKI6EFCAAH= !=$#" '& 6EFKI6EFAIFA?BE? !=$"&$ %%&

_**

Any suggestions if it can be fixed?

Thank you

samkit-jain commented 4 years ago

Hi @dasapa Thanks for your interest in the library. Could you please share the PDF so that we can investigate further? What happens when you use pdftotext? Is the text captured correctly?

dasapa commented 4 years ago

Hi Samkit,

thanks for the answer, there are two pdf files in wetransfer link. What is pdftotext? other librarie?

https://wetransfer.com/downloads/932b33efbce9be5fb51972d908bf5d7c20200927190237/e3355946c178b7b3430d4d07d60807be20200927190304/ee1009

samkit-jain commented 4 years ago

Thanks for sharing the PDFs @dasapa pdftotext is a utility to extract text from PDFs.

In the PDF, if you try and copy-paste, gibberish comes out. It is sometimes done intentionally by the PDF creators to prevent people from copy-pasting text from the PDF in which the unicode mapping (what character a glyph represents) is not included in the PDF or is incomplete. One workaround could be to run OCR on the PDF.

SAIVENKATARAJU commented 2 years ago

I am having the same issue. In my case most of the lines are replacing with question mark. most of the lines are getting lost

afierro23 commented 2 years ago

@SAIVENKATARAJU were you able to figure out why you were getting the question marks? I am having a similar issue on some pdfs.

Thanks!

SAIVENKATARAJU commented 2 years ago

@afierro23 In my case, I am not able to do because the pdf creators itself use custom encoding techniques to avoid copying, We should go with OCR if required.