UnicodeEncodeError: 'charmap' codec can't encode character '\uf0b7' in position 908: character maps to <undefined>

jchristn commented 11 months ago

Describe the bug

With the following code:

import sys
import pdfplumber

filename = sys.argv[1]
# print("Filename: " + filename)

with pdfplumber.open(filename) as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

And the following sample file: 2.pdf

I receive the error:

Traceback (most recent call last):
  File "C:\Code\Misc\PdfParser\src\Test.PdfParser\bin\Debug\net6.0\pdf.py", line 10, in <module>
    print(text)
  File "C:\Users\joelc\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\uf0b7' in position 908: character maps to <undefined>

And python returns with status code 1.

Have you tried repairing the PDF?

Please try running your code with pdfplumber.open(..., repair=True) before submitting a bug report.

I just modified the code and ran with , repair=True and received a different error. This error continued even after installing Ghostscript.

Traceback (most recent call last):
  File "C:\Code\Misc\PdfParser\src\Test.PdfParser\bin\Debug\net6.0\pdf.py", line 7, in <module>
    with pdfplumber.open(filename, repair=True) as pdf:
  File "C:\Users\joelc\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfplumber\pdf.py", line 78, in open
    stream = _repair(path_or_fp, password=password)
  File "C:\Users\joelc\AppData\Local\Programs\Python\Python310\lib\site-packages\pdfplumber\repair.py", line 15, in _repair
    raise Exception(
Exception: Cannot find Ghostscript, which is required for repairs.
Visit https://www.ghostscript.com/ for installation instructions.

Code to reproduce the problem

Without repair=True

import sys
import pdfplumber

filename = sys.argv[1]
# print("Filename: " + filename)

with pdfplumber.open(filename) as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

With repair=True

import sys
import pdfplumber

filename = sys.argv[1]
# print("Filename: " + filename)

with pdfplumber.open(filename, repair=True) as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

PDF file

Please attach any PDFs necessary to reproduce the problem.

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

2.pdf

Expected behavior

What did you expect the result should have been?

Return the text from the document.

Actual behavior

What actually happened, instead?

Error message as shown above.

Screenshots

If applicable, add screenshots to help explain your problem.

N/A

Environment

pdfplumber version: [e.g., 0.5.22] --> 0.10.2
Python version: [e.g., 3.8.1] --> 3.10.4
OS: [e.g., Mac, Linux, etc.] --> Windows 11

Additional context

Add any other context/notes about the problem here.

jsvine commented 11 months ago

Hi @jchristn, and thanks for sharing the code and PDF. Your first block of code runs error-free on my Mac, emitting the expected text. My best guess is that this is coming from pdfminer.six, this library's main dependency. Have you tried extracting the text with pdfminer.six alone?

Re. Ghostscript: Seems like pdfplumber can't find the Ghostscript executable; very possibly a bug on my end. What is the path to, or name of, the Ghostscript executable on your machine?

jchristn commented 11 months ago

Hi @jsvine thanks for your follow-up. I have not tried pdfminer.six, but will do it ASAP.

Ghostscript seems to be installed in C:\Program Files\gs\gs10.01.2\bin. Would I need to add this to my path?

jchristn commented 11 months ago

It seems to work with pdfminer.six.

Code:

import sys
from pdfminer.high_level import extract_text

filename = sys.argv[1]
# print("Filename: " + filename)

text = extract_text(filename)
print(text)

Document:

2.pdf

Result:

C:\Code\Python\pdfminer.six>py pdf.py 2.pdf
Lorem ipsum

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Nunc ac faucibus odio.

Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut
varius  sem.  Nullam  at  porttitor  arcu,  nec  lacinia  nisi.  Ut  ac  dolor  vitae  odio  interdum
condimentum.  Vivamus  dapibus  sodales  ex,  vitae  malesuada  ipsum  cursus
convallis.  Maecenas  sed  egestas  nulla,  ac  condimentum  orci.  Mauris  diam  felis,
vulputate  ac  suscipit  et,  iaculis  non  est.  Curabitur  semper  arcu  ac  ligula  semper,  nec
luctus nisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis
ipsum, ac accumsan nunc vehicula
[truncated to preserve sanity]

jsvine commented 11 months ago

Thank you, @jchristn. Both of your notes are very helpful. I'll investigate and update you here.

jsvine commented 11 months ago

Hi @jchristn, two updates for you:

On the main error you're encountering: I should have looked more closely at the error message earlier. Since it's being triggered by the print(text) step, rather than the extraction step, this seems to be caused by something with the way your installation of Python (or maybe terminal?) is handling utf-8-encoded text. (It appears it may be set to output cp1252-encoded text instead?) Not sure why it doesn't error on the pdfminer.six version; perhaps it's automatically stripping out the offending \uf0b7 character.
On the Ghostscript issue, I just added a feature (currently only available on the develop branch) that lets you pass a custom gs_path=... argument when repairing. This is moot, effectively, for your issue (since your PDF isn't in need of repair), but could be helpful for others.

jchristn commented 11 months ago

I'm running in Windows Terminal. Recommendation on what/how I could test? Shall I try the code and PDF above in Ubuntu?

Cheers, Joel

jsvine commented 11 months ago

Running the code with the same PDF in Ubuntu sounds like a great test. Thanks!

jchristn commented 11 months ago

Seems to work in Ubuntu 22.04:

joel@joelworkstation:~/code/python$ ls
2.pdf  pdf.py
joel@joelworkstation:~/code/python$ python3 pdf.py 2.pdf
Lorem ipsum
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Nunc ac faucibus odio.
Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut
varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum
condimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus
convallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis,
vulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec
luctus nisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis
ipsum, ac accumsan nunc vehicula vitae. Nulla eget justo in felis tristique fringilla. Morbi
sit amet tortor quis risus auctor condimentum. Morbi in ullamcorper elit. Nulla iaculis tellus
…
joel@joelworkstation:~/code/python$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:        22.04
Codename:       jammy

Any idea how to get this working on Windows 11?

jsvine commented 11 months ago

I don't have a Windows machine to test this on, but does the suggestion here work for you?: https://stackoverflow.com/questions/14284269/why-doesnt-python-recognize-my-utf-8-encoded-source-file/14284404#14284404

jchristn commented 11 months ago

Well, that did the trick :)

C:\Code\Python\pdfplumber>chcp 65001
Active code page: 65001

Guess this one can be closed!

jchristn commented 11 months ago

So I'm trying to encapsulate this into a C# program that invokes shell commands to run the Python script. Looks like there may be a fix if the .open call can have the encoding specified.

https://stackoverflow.com/questions/27092833/unicodeencodeerror-charmap-codec-cant-encode-characters

As an FYI, running this through a shell wrapper works on Ubuntu but not on Windows, even if the command is:

>chcp 65001 && py pdf.py sample\pdf\2.pdf
Active code page: 65001
Lorem ipsum
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Nunc ac faucibus odio.
Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut
varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum
condimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus
convallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis,
vulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec
luctus nisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis
ipsum, ac accumsan nunc vehicula vitae. Nulla eget justo in felis tristique fringilla. Morbi
sit amet tortor quis risus auctor condimentum. Morbi in ullamcorper elit. Nulla iaculis tellus
sit amet mauris tempus fringilla.
Maecenas mauris lectus, lobortis et purus mattis, blandit dictum tellus.
 Maecenas non lorem quis tellus placerat varius.
 Nulla facilisi.

But through the shell wrapper, no dice.

jchristn commented 11 months ago

For reference (just to be clear, I don't expect you to support this library, which is a library I've published via MIT license):


            string command = "";

            if (OperatingSystem.IsWindows())
            {
                // https://stackoverflow.com/questions/14284269/why-doesnt-python-recognize-my-utf-8-encoded-source-file/14284404#14284404
                command += "chcp 65001 && ";
            }

            Shelli.OutputDataReceived = (s) =>
            {
                lastDataReceived = DateTime.UtcNow;
                dataSb.Append(s + Environment.NewLine);
            };

            Shelli.ErrorDataReceived = (s) =>
            {
                lastErrorReceived = DateTime.UtcNow;
                errorSb.Append(s + Environment.NewLine);
            };

            command += "py pdf.py " + _Filename;

            int returnCode = Shelli.Go(command);

Results in:

[PdfParser] non-zero return code received from pdf.py: 1

Traceback (most recent call last):
  File "C:\Code\Misc\PdfParser\src\Test.PdfParser\bin\Debug\net6.0\pdf.py", line 10, in <module>
    print(text)
  File "C:\Users\joelc\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\uf0b7' in position 908: character maps to <undefined>

jchristn commented 11 months ago

Not sure if this is applicable: https://stackoverflow.com/questions/27092833/unicodeencodeerror-charmap-codec-cant-encode-characters but I can't find anything else that might help.

Problem also exists when doing

C:\Code\Misc\PdfParser\src\Test.PdfParser\bin\Debug\net6.0>chcp 65001
Active code page: 65001

C:\Code\Misc\PdfParser\src\Test.PdfParser\bin\Debug\net6.0>py pdf.py sample\pdf\2.pdf > foo
Traceback (most recent call last):
  File "C:\Code\Misc\PdfParser\src\Test.PdfParser\bin\Debug\net6.0\pdf.py", line 10, in <module>
    print(text)
  File "C:\Users\joelc\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\uf0b7' in position 908: character maps to <undefined>

jsvine commented 11 months ago

I may be wrong but, judging by the above, this seems like an issue with the shell wrapper (and how it handles unicode), and independent of pdfplumber. As a test, can you have the shell wrapper try to run this minimal Python program?:

print("\uf0b7")

jchristn commented 11 months ago

Do you support passing encoding='utf-8' into your pdfplumber.open method? The problem exists even with print("\uf0b7")

jchristn commented 11 months ago

Hi @jsvine I just tried the print statement you recommended.

1) Natively with py, worked great 2) Via the shell wrapper, exception

C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0>type printchar.py
print("\uf0b7")

C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0>py printchar.py


C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0>test
Command [q to quit]: py printchar.py
*** Traceback (most recent call last):
***   File "C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0\printchar.py", line 1, in <module>
***     print("\uf0b7")
***   File "C:\Users\joelc\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 19, in encode
***     return codecs.charmap_encode(input,self.errors,encoding_table)[0]
*** UnicodeEncodeError: 'charmap' codec can't encode character '\uf0b7' in position 0: character maps to <undefined>

Return code: 1

jchristn commented 11 months ago

It appears that, at least in terms of the shell wrapper, the issue is in assigning a standard encoding:

            p.StartInfo.StandardOutputEncoding = Encoding.GetEncoding("utf-16");
            p.StartInfo.StandardErrorEncoding = Encoding.GetEncoding("utf-16");

With py printchar.py (the print statement you shared above):

C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0>test
Command [q to quit]: py printchar.py
*** ???????4??????????+????????????????????????????????????????????++???????? ??????????????????????????????????????????????????????++??????????????????????????????????????????????????????????????????????????>???????????????

Return code: 1
Command [q to quit]: Command [q to quit]: ^C

C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0>py printchar.py


Similarly I tried with "utf-32":

C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0>test
Command [q to quit]: py printchar.py
*** ??????????????????????????????????????????????????????????????????????????????????????????????????????????????

Return code: 1
Command [q to quit]: Command [q to quit]: ^C
C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0>
C:\Code\Misc\Shelli\src\Test\bin\Debug\net6.0>py printchar.py


Note that the py command is not returning a return code of 0.

jchristn commented 11 months ago

Ok, I think I have this figured out. This will be useful for anyone that is calling your Python library from C# Process.Start (or the Shelli library):

string cmd = "chcp 65001 && SET PYTHONIOENCODING=utf8 && py myfile.py";
Process p = new Process();
p.StartInfo = ...
p.Arguments = ...
p.Start(cmd);

jsvine commented 11 months ago

Thanks, @jchristn! I'll add: Given that the minimal Python program print("\uf0b7") also triggered the error, this does not seem to be an issue particular to pdfplumber, but rather any program (Python or otherwise) that emits utf-8 encoded text (which would be many).

jchristn commented 11 months ago

Hi Jeremy, going to send you an email shortly. Cheers, Joel

jsvine / pdfplumber