Closed dasapa closed 4 years ago
Hi @dasapa Thanks for your interest in the library. Could you please share the PDF so that we can investigate further? What happens when you use pdftotext
? Is the text captured correctly?
Hi Samkit,
thanks for the answer, there are two pdf files in wetransfer link. What is pdftotext? other librarie?
Thanks for sharing the PDFs @dasapa pdftotext
is a utility to extract text from PDFs.
In the PDF, if you try and copy-paste, gibberish comes out. It is sometimes done intentionally by the PDF creators to prevent people from copy-pasting text from the PDF in which the unicode mapping (what character a glyph represents) is not included in the PDF or is incomplete. One workaround could be to run OCR on the PDF.
I am having the same issue. In my case most of the lines are replacing with question mark. most of the lines are getting lost
@SAIVENKATARAJU were you able to figure out why you were getting the question marks? I am having a similar issue on some pdfs.
Thanks!
@afierro23 In my case, I am not able to do because the pdf creators itself use custom encoding techniques to avoid copying, We should go with OCR if required.
Hello,
I have extracted data from several invoices, everything has gone perfectly, but with one when I extract text it shows characters like the following:
**_++IIKKB=B?=J?KJHK=HJ=@ +IKJH=!="&& +IKJH= !=$#'! +@A=ECK=+=@A=CK= -LK?E?IK -LK?E?IK 6EFKI6EFCAAH= !=$#" '& 6EFKI6EFAIFA?BE? !=$"&$ %%&
_**
Any suggestions if it can be fixed?
Thank you