jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.31k stars 647 forks source link

ValueError: bytes must be in range(0, 256)` in page.chars #695

Closed bpugnaire closed 2 years ago

bpugnaire commented 2 years ago

Describe the bug

When accessing page.chars, a ValueError: bytes must be in range(0, 256) Here is the full exception :

if len(page_cropped.chars) != 0: File "/home/baptistepugnaire/.local/share/virtualenvs/extraction_donnees_pdf-FvW0QW7r/lib/python3.8/site-packages/pdfplumber/container.py", line 50, in chars return self.objects.get("char", []) File "/home/baptistepugnaire/.local/share/virtualenvs/extraction_donnees_pdf-FvW0QW7r/lib/python3.8/site-packages/pdfplumber/page.py", line 469, in objects k: self._crop_fn(v) for k, v in self.parent_page.objects.items() File "/home/baptistepugnaire/.local/share/virtualenvs/extraction_donnees_pdf-FvW0QW7r/lib/python3.8/site-packages/pdfplumber/page.py", line 192, in objects self._objects: Dict[str, T_obj_list] = self.parse_objects() File "/home/baptistepugnaire/.local/share/virtualenvs/extraction_donnees_pdf-FvW0QW7r/lib/python3.8/site-packages/pdfplumber/page.py", line 248, in parse_objects for obj in self.iter_layout_objects(self.layout._objs): File "/home/baptistepugnaire/.local/share/virtualenvs/extraction_donnees_pdf-FvW0QW7r/lib/python3.8/site-packages/pdfplumber/page.py", line 138, in layout interpreter.process_page(self.page_obj) File "/home/baptistepugnaire/.local/share/virtualenvs/extraction_donnees_pdf-FvW0QW7r/lib/python3.8/site-packages/pdfminer/pdfinterp.py", line 991, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "/home/baptistepugnaire/.local/share/virtualenvs/extraction_donnees_pdf-FvW0QW7r/lib/python3.8/site-packages/pdfminer/pdfinterp.py", line 1010, in render_contents self.execute(list_value(streams)) File "/home/baptistepugnaire/.local/share/virtualenvs/extraction_donnees_pdf-FvW0QW7r/lib/python3.8/site-packages/pdfminer/pdfinterp.py", line 1021, in execute (_, obj) = parser.nextobject() File "/home/baptistepugnaire/.local/share/virtualenvs/extraction_donnees_pdf-FvW0QW7r/lib/python3.8/site-packages/pdfminer/psparser.py", line 607, in nextobject (pos, token) = self.nexttoken() File "/home/baptistepugnaire/.local/share/virtualenvs/extraction_donnees_pdf-FvW0QW7r/lib/python3.8/site-packages/pdfminer/psparser.py", line 525, in nexttoken self.charpos = self._parse1(self.buf, self.charpos) File "/home/baptistepugnaire/.local/share/virtualenvs/extraction_donnees_pdf-FvW0QW7r/lib/python3.8/site-packages/pdfminer/psparser.py", line 474, in _parse_string_1 self._curtoken += bytes((int(self.oct, 8),)) ValueError: bytes must be in range(0, 256)

Code to reproduce the problem

len(page_cropped.chars) != 0

PDF file

Unfortunately due to my own incompetence I didn't log which PDF was faulty so I can't help you with that, I'm sorry.

Expected behavior

No errors when accessing chars

Environment

jsvine commented 2 years ago

Looking at the stack trace — specifically the final part, pdfminer/psparser.py", line 474, in _parse_string_1 self._curtoken += bytes((int(self.oct, 8),)) ValueError: bytes must be in range(0, 256) — this appears to stem from either a bug in pdfminer.six or (generally more common) a malformed PDF. Seems like part of the PDF was saying it was using an octal number, but provided a value inconsistent with an octal.

If you find the PDF, you can try repairing it with GhostScript.