pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.96k stars 930 forks source link

UnicodeEncodeError when piping/Tee-Object on Windows #951

Open jamesdeluk opened 8 months ago

jamesdeluk commented 8 months ago

Running the script normally seems to work, printing out the full file.

However, if I try to pipe or Tee-Object:

python .\pdf2txt.py file.pdf > file.txt

or python .\pdf2txt.py file.pdf | Tee-Object file.txt

I get the following error (Command Prompt and PowerShell):

Traceback (most recent call last):
  File "C:\Users\user\Downloads\pdfminer-env\Scripts\pdf2txt.py", line 317, in <module>
    sys.exit(main())
             ^^^^^^
  File "C:\Users\user\Downloads\pdfminer-env\Scripts\pdf2txt.py", line 311, in main
    outfp = extract_text(**vars(parsed_args))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\Downloads\pdfminer-env\Scripts\pdf2txt.py", line 62, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "C:\Users\user\Downloads\pdfminer-env\Lib\site-packages\pdfminer\high_level.py", line 132, in extract_text_to_fp
    interpreter.process_page(page)
  File "C:\Users\user\Downloads\pdfminer-env\Lib\site-packages\pdfminer\pdfinterp.py", line 998, in process_page
    self.device.end_page(page)
  File "C:\Users\user\Downloads\pdfminer-env\Lib\site-packages\pdfminer\converter.py", line 81, in end_page
    self.receive_layout(self.cur_item)
  File "C:\Users\user\Downloads\pdfminer-env\Lib\site-packages\pdfminer\converter.py", line 352, in receive_layout
    render(ltpage)
  File "C:\Users\user\Downloads\pdfminer-env\Lib\site-packages\pdfminer\converter.py", line 341, in render
    render(child)
  File "C:\Users\user\Downloads\pdfminer-env\Lib\site-packages\pdfminer\converter.py", line 341, in render
    render(child)
  File "C:\Users\user\Downloads\pdfminer-env\Lib\site-packages\pdfminer\converter.py", line 341, in render
    render(child)
  File "C:\Users\user\Downloads\pdfminer-env\Lib\site-packages\pdfminer\converter.py", line 343, in render
    self.write_text(item.get_text())
  File "C:\Users\user\Downloads\pdfminer-env\Lib\site-packages\pdfminer\converter.py", line 335, in write_text
    cast(TextIO, self.outfp).write(text)
  File "C:\Program Files\Python311\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\x83' in position 0: character maps to <undefined>