jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Is there a different language pack for special characters #252

Closed vesko-vujovic closed 4 years ago

vesko-vujovic commented 4 years ago

Hi I'm using pdfplummber to extract words and bbox for those words. It is doing a great job, except the fact that he eats some letters for the words that have Romanian special characters.

Is there different encoding or language pack that I can use in my case?

samkit-jain commented 4 years ago

Hi @vesko-vujovic Thanks for your interest in the library. Could you please share a sample PDF redacting any sensitive information with some reproducible code detailing what should have happened and what did happen?

Assuming your question can be interpreted as "I have some French characters in my PDF but only English ones are getting extracted." to which have you tried changing the locale on your machine?

vesko-vujovic commented 4 years ago

@samkit-jain sorry for not sharing PDF, i really can't share these PDF's here.

In general i have invoices in Romanian and i have a word Factură, after running pdfplumber extract_words on that document i get Factur

So changing locale on local machine would solve this issue?

Thanks for a quick reply.

samkit-jain commented 4 years ago

No problem @vesko-vujovic I think it should. You can set LC_ALL environment variable to C.UTF-8 and try again. To do so in Linux, run export LC_ALL=C.UTF-8. Can also try running export PYTHONIOENCODING=UTF-8 as well.

vesko-vujovic commented 4 years ago

@samkit-jain I've done everything that you suggested but still it has only problems with letter ă, both lowercase and uppercase.

samkit-jain commented 4 years ago

Hi @vesko-vujovic Unfortunately, without the PDF, there's not much that I can do. I tried reproducing your issue by creating a PDF like ttt.pdf When I run pdfplumber on this, it works fine. When you use this PDF, is the special character extracted or not? If it is, then it would most likely be an issue with the PDF you are dealing with. If you are not able to extract, then it could be a configuration issue on your machine.

$ python
Python 3.8.2 (default, Feb 26 2020, 02:56:10) 
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pdfplumber
>>> pdf = pdfplumber.open("ttt.pdf")
>>> pdf.pages[0].extract_text()
'Factură'
vesko-vujovic commented 4 years ago

Hi @samkit-jain thank you for your effort apparently something is wrong with the PDF. I will close this issue if you agree?

samkit-jain commented 4 years ago

Happy to help and thanks for confirming @vesko-vujovic Closing the issue now.