Closed jchristn closed 11 months ago
Hi @jchristn, and thanks for sharing the code and PDF. Your first block of code runs error-free on my Mac, emitting the expected text. My best guess is that this is coming from pdfminer.six
, this library's main dependency. Have you tried extracting the text with pdfminer.six
alone?
Re. Ghostscript: Seems like pdfplumber
can't find the Ghostscript executable; very possibly a bug on my end. What is the path to, or name of, the Ghostscript executable on your machine?
Hi @jsvine thanks for your follow-up. I have not tried pdfminer.six, but will do it ASAP.
Ghostscript seems to be installed in C:\Program Files\gs\gs10.01.2\bin
. Would I need to add this to my path?
It seems to work with pdfminer.six.
Code:
import sys
from pdfminer.high_level import extract_text
filename = sys.argv[1]
# print("Filename: " + filename)
text = extract_text(filename)
print(text)
Document:
Result:
C:\Code\Python\pdfminer.six>py pdf.py 2.pdf
Lorem ipsum
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Nunc ac faucibus odio.
Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut
varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum
condimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus
convallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis,
vulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec
luctus nisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis
ipsum, ac accumsan nunc vehicula
[truncated to preserve sanity]
Thank you, @jchristn. Both of your notes are very helpful. I'll investigate and update you here.
Hi @jchristn, two updates for you:
On the main error you're encountering: I should have looked more closely at the error message earlier. Since it's being triggered by the print(text)
step, rather than the extraction step, this seems to be caused by something with the way your installation of Python (or maybe terminal?) is handling utf-8-encoded text. (It appears it may be set to output cp1252
-encoded text instead?) Not sure why it doesn't error on the pdfminer.six
version; perhaps it's automatically stripping out the offending \uf0b7
character.
On the Ghostscript issue, I just added a feature (currently only available on the develop
branch) that lets you pass a custom gs_path=...
argument when repairing. This is moot, effectively, for your issue (since your PDF isn't in need of repair), but could be helpful for others.
I'm running in Windows Terminal. Recommendation on what/how I could test? Shall I try the code and PDF above in Ubuntu?
Cheers, Joel
Running the code with the same PDF in Ubuntu sounds like a great test. Thanks!
Seems to work in Ubuntu 22.04:
joel@joelworkstation:~/code/python$ ls
2.pdf pdf.py
joel@joelworkstation:~/code/python$ python3 pdf.py 2.pdf
Lorem ipsum
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Nunc ac faucibus odio.
Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut
varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum
condimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus
convallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis,
vulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec
luctus nisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis
ipsum, ac accumsan nunc vehicula vitae. Nulla eget justo in felis tristique fringilla. Morbi
sit amet tortor quis risus auctor condimentum. Morbi in ullamcorper elit. Nulla iaculis tellus
…
joel@joelworkstation:~/code/python$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.1 LTS
Release: 22.04
Codename: jammy
Any idea how to get this working on Windows 11?
I don't have a Windows machine to test this on, but does the suggestion here work for you?: https://stackoverflow.com/questions/14284269/why-doesnt-python-recognize-my-utf-8-encoded-source-file/14284404#14284404
Well, that did the trick :)
C:\Code\Python\pdfplumber>chcp 65001
Active code page: 65001
Guess this one can be closed!
So I'm trying to encapsulate this into a C# program that invokes shell commands to run the Python script. Looks like there may be a fix if the .open call can have the encoding specified.
https://stackoverflow.com/questions/27092833/unicodeencodeerror-charmap-codec-cant-encode-characters
As an FYI, running this through a shell wrapper works on Ubuntu but not on Windows, even if the command is:
>chcp 65001 && py pdf.py sample\pdf\2.pdf
Active code page: 65001
Lorem ipsum
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Nunc ac faucibus odio.
Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut
varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum
condimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus
convallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis,
vulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec
luctus nisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis
ipsum, ac accumsan nunc vehicula vitae. Nulla eget justo in felis tristique fringilla. Morbi
sit amet tortor quis risus auctor condimentum. Morbi in ullamcorper elit. Nulla iaculis tellus
sit amet mauris tempus fringilla.
Maecenas mauris lectus, lobortis et purus mattis, blandit dictum tellus.
Maecenas non lorem quis tellus placerat varius.
Nulla facilisi.
But through the shell wrapper, no dice.
For reference (just to be clear, I don't expect you to support this library, which is a library I've published via MIT license):
string command = "";
if (OperatingSystem.IsWindows())
{
// https://stackoverflow.com/questions/14284269/why-doesnt-python-recognize-my-utf-8-encoded-source-file/14284404#14284404
command += "chcp 65001 && ";
}
Shelli.OutputDataReceived = (s) =>
{
lastDataReceived = DateTime.UtcNow;
dataSb.Append(s + Environment.NewLine);
};
Shelli.ErrorDataReceived = (s) =>
{
lastErrorReceived = DateTime.UtcNow;
errorSb.Append(s + Environment.NewLine);
};
command += "py pdf.py " + _Filename;
int returnCode = Shelli.Go(command);
Results in:
[PdfParser] non-zero return code received from pdf.py: 1
Traceback (most recent call last):
File "C:\Code\Misc\PdfParser\src\Test.PdfParser\bin\Debug\net6.0\pdf.py", line 10, in <module>
print(text)
File "C:\Users\joelc\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\uf0b7' in position 908: character maps to <undefined>
Not sure if this is applicable: https://stackoverflow.com/questions/27092833/unicodeencodeerror-charmap-codec-cant-encode-characters but I can't find anything else that might help.
Problem also exists when doing
C:\Code\Misc\PdfParser\src\Test.PdfParser\bin\Debug\net6.0>chcp 65001
Active code page: 65001
C:\Code\Misc\PdfParser\src\Test.PdfParser\bin\Debug\net6.0>py pdf.py sample\pdf\2.pdf > foo
Traceback (most recent call last):
File "C:\Code\Misc\PdfParser\src\Test.PdfParser\bin\Debug\net6.0\pdf.py", line 10, in <module>
print(text)
File "C:\Users\joelc\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\uf0b7' in position 908: character maps to <undefined>
I may be wrong but, judging by the above, this seems like an issue with the shell wrapper (and how it handles unicode), and independent of pdfplumber
. As a test, can you have the shell wrapper try to run this minimal Python program?:
print("\uf0b7")
Do you support passing encoding='utf-8'
into your pdfplumber.open
method? The problem exists even with print("\uf0b7")
Hi @jsvine I just tried the print
statement you recommended.
1) Natively with py, worked great 2) Via the shell wrapper, exception
C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0>type printchar.py
print("\uf0b7")
C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0>py printchar.py
C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0>test
Command [q to quit]: py printchar.py
*** Traceback (most recent call last):
*** File "C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0\printchar.py", line 1, in <module>
*** print("\uf0b7")
*** File "C:\Users\joelc\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
*** return codecs.charmap_encode(input,self.errors,encoding_table)[0]
*** UnicodeEncodeError: 'charmap' codec can't encode character '\uf0b7' in position 0: character maps to <undefined>
Return code: 1
It appears that, at least in terms of the shell wrapper, the issue is in assigning a standard encoding:
p.StartInfo.StandardOutputEncoding = Encoding.GetEncoding("utf-16");
p.StartInfo.StandardErrorEncoding = Encoding.GetEncoding("utf-16");
With py printchar.py
(the print
statement you shared above):
C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0>test
Command [q to quit]: py printchar.py
*** ???????4??????????+????????????????????????????????????????????++???????? ??????????????????????????????????????????????????????++??????????????????????????????????????????????????????????????????????????>???????????????
Return code: 1
Command [q to quit]: Command [q to quit]: ^C
C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0>py printchar.py
Similarly I tried with "utf-32"
:
C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0>test
Command [q to quit]: py printchar.py
*** ??????????????????????????????????????????????????????????????????????????????????????????????????????????????
Return code: 1
Command [q to quit]: Command [q to quit]: ^C
C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0>
C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0>py printchar.py
Note that the py
command is not returning a return code of 0
.
Ok, I think I have this figured out. This will be useful for anyone that is calling your Python library from C# Process.Start
(or the Shelli
library):
string cmd = "chcp 65001 && SET PYTHONIOENCODING=utf8 && py myfile.py";
Process p = new Process();
p.StartInfo = ...
p.Arguments = ...
p.Start(cmd);
Thanks, @jchristn! I'll add: Given that the minimal Python program print("\uf0b7")
also triggered the error, this does not seem to be an issue particular to pdfplumber
, but rather any program (Python or otherwise) that emits utf-8 encoded text (which would be many).
Hi Jeremy, going to send you an email shortly. Cheers, Joel
Describe the bug
With the following code:
And the following sample file: 2.pdf
I receive the error:
And python returns with status code
1
.Have you tried repairing the PDF?
Please try running your code with
pdfplumber.open(..., repair=True)
before submitting a bug report.I just modified the code and ran with
, repair=True
and received a different error. This error continued even after installing Ghostscript.Code to reproduce the problem
Without
repair=True
With
repair=True
PDF file
Please attach any PDFs necessary to reproduce the problem.
If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.
2.pdf
Expected behavior
What did you expect the result should have been?
Return the text from the document.
Actual behavior
What actually happened, instead?
Error message as shown above.
Screenshots
If applicable, add screenshots to help explain your problem.
N/A
Environment
Additional context
Add any other context/notes about the problem here.