jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Read pdf error on linux #635

Closed FANGOD closed 2 years ago

FANGOD commented 2 years ago

pdfplumber 0.6.0

Traceback (most recent call last):
  File "test-c.py", line 94, in <module>
    raw_text += page.extract_text()
  File "/home/lpc/.local/lib/python3.8/site-packages/pdfplumber/page.py", line 258, in extract_text
    self.chars, x_shift=self.bbox[0], y_shift=self.bbox[1], **kwargs
  File "/home/lpc/.local/lib/python3.8/site-packages/pdfplumber/container.py", line 49, in chars
    return self.objects.get("char", [])
  File "/home/lpc/.local/lib/python3.8/site-packages/pdfplumber/page.py", line 152, in objects
    self._objects = self.parse_objects()
  File "/home/lpc/.local/lib/python3.8/site-packages/pdfplumber/page.py", line 208, in parse_objects
    for obj in self.iter_layout_objects(self.layout._objs):
  File "/home/lpc/.local/lib/python3.8/site-packages/pdfplumber/page.py", line 98, in layout
    interpreter.process_page(self.page_obj)
  File "/home/lpc/.local/lib/python3.8/site-packages/pdfminer/pdfinterp.py", line 841, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/home/lpc/.local/lib/python3.8/site-packages/pdfminer/pdfinterp.py", line 854, in render_contents
    self.execute(list_value(streams))
  File "/home/lpc/.local/lib/python3.8/site-packages/pdfminer/pdfinterp.py", line 869, in execute
    name = keyword_name(obj).decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x86 in position 0: ordinal not in range(128)

pdf: https://www.bitdefender.com/content/dam/bitdefender/business/whitepapers/pdf/small-Bitdefender-Whitepaper-Virt-CIO-A4-en-EN-screen-compressed.pdf

Works fine on windows .

jsvine commented 2 years ago

Hi @FANGOD, and thanks for your interest in this library. I can confirm that the PDF also processes fine on Mac. Not sure what would be causing the issue w/ Linux but, in any case, the stacktrace seems to indicate that the error stems from pdfminer (the lower-level library we use to extract the PDF's structural information and objects) rather than pdfplumber. For that reason, I'm closing this issue. If you'd like, however, you can open an issue in the pdfminer repository; I would recommend pasting or attaching a minimal Python script that fully reproduces the problem. Something like:

import pdfminer
from pdfminer.high_level import extract_text
print(f"pdfminer version {pdfminer.__version__}")
extract_text('small-Bitdefender-Whitepaper-Virt-CIO-A4-en-EN-screen-compressed.pdf')

... assuming that this does reproduce the problem on Linux.