Closed vesko-vujovic closed 4 years ago
Hi @vesko-vujovic Thanks for your interest in the library. Could you please share a sample PDF redacting any sensitive information with some reproducible code detailing what should have happened and what did happen?
Assuming your question can be interpreted as "I have some French characters in my PDF but only English ones are getting extracted." to which have you tried changing the locale on your machine?
@samkit-jain sorry for not sharing PDF, i really can't share these PDF's here.
In general i have invoices in Romanian and i have a word Factură, after running pdfplumber extract_words on that document i get Factur
So changing locale on local machine would solve this issue?
Thanks for a quick reply.
No problem @vesko-vujovic I think it should. You can set LC_ALL
environment variable to C.UTF-8
and try again. To do so in Linux, run export LC_ALL=C.UTF-8
. Can also try running export PYTHONIOENCODING=UTF-8
as well.
@samkit-jain I've done everything that you suggested but still it has only problems with letter ă, both lowercase and uppercase.
Hi @vesko-vujovic Unfortunately, without the PDF, there's not much that I can do. I tried reproducing your issue by creating a PDF like ttt.pdf When I run pdfplumber on this, it works fine. When you use this PDF, is the special character extracted or not? If it is, then it would most likely be an issue with the PDF you are dealing with. If you are not able to extract, then it could be a configuration issue on your machine.
$ python
Python 3.8.2 (default, Feb 26 2020, 02:56:10)
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pdfplumber
>>> pdf = pdfplumber.open("ttt.pdf")
>>> pdf.pages[0].extract_text()
'Factură'
Hi @samkit-jain thank you for your effort apparently something is wrong with the PDF. I will close this issue if you agree?
Happy to help and thanks for confirming @vesko-vujovic Closing the issue now.
Hi I'm using pdfplummber to extract words and bbox for those words. It is doing a great job, except the fact that he eats some letters for the words that have Romanian special characters.
Is there different encoding or language pack that I can use in my case?