jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.1k stars 625 forks source link

extract_words throws unsupported operand type(s) for +: 'PSLiteral' and 'list' #874

Closed Laubeee closed 1 year ago

Laubeee commented 1 year ago

Describe the bug

I'm getting unsupported operand type(s) for +: 'PSLiteral' and 'list' in page.extract_words(x_tolerance=1, y_tolerance=1, horizontal_ltr=True, vertical_ttb=False, keep_blank_chars=True, use_text_flow=True).

The PDF has 3 pages and the first 2 do not encounter this error. Other PDFs work fine.

Code to reproduce the problem

    try:
        if utils.is_pdf(file_path) or utils.is_image(file_path):
            if utils.is_pdf(file_path):
                with pdfplumber.open(os.path.abspath(file_path)) as pdf:
                    for page_num, page in enumerate(pdf.pages):
                        words = page.extract_words(x_tolerance=1, y_tolerance=1, horizontal_ltr=True, vertical_ttb=False, keep_blank_chars=True, use_text_flow=True)
    except Exception as ex:
        traceback.print_exc()
        print(ex)

PDF file

Unfortunately I cannot share the PDF from the customer (contains sensitive graphical content)

Expected behavior

word extraction without exception

Actual behavior

method threw an exception

Environment

Additional context

Traceback (most recent call last):
  File "C:\Users\<user>\AppData\Local\Temp\ipykernel_18836\387819089.py", line 34, in convert_file_to_images
    words = page.extract_words(x_tolerance=1, y_tolerance=1, horizontal_ltr=True, vertical_ttb=False, keep_blank_chars=True, use_text_flow=True)
  File "C:\Projects\<proj>\venv\lib\site-packages\pdfplumber\page.py", line 356, in extract_words
    return utils.extract_words(self.chars, **kwargs)
  File "C:\Projects\<proj>\venv\lib\site-packages\pdfplumber\container.py", line 50, in chars
    return self.objects.get("char", [])
  File "C:\Projects\<proj>\venv\lib\site-packages\pdfplumber\page.py", line 215, in objects
    self._objects: Dict[str, T_obj_list] = self.parse_objects()
  File "C:\Projects\<proj>\venv\lib\site-packages\pdfplumber\page.py", line 275, in parse_objects
    for obj in self.iter_layout_objects(self.layout._objs):
  File "C:\Projects\<proj>\venv\lib\site-packages\pdfplumber\page.py", line 161, in layout
    interpreter.process_page(self.page_obj)
  File "C:\Projects\<proj>\venv\lib\site-packages\pdfminer\pdfinterp.py", line 997, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "C:\Projects\<proj>\venv\lib\site-packages\pdfminer\pdfinterp.py", line 1016, in render_contents
    self.execute(list_value(streams))
  File "C:\Projects\<proj>\venv\lib\site-packages\pdfminer\pdfinterp.py", line 1042, in execute
    func(*args)
  File "C:\Projects\<proj>\venv\lib\site-packages\pdfminer\pdfinterp.py", line 560, in do_re
    self.curpath.append(("l", x + w, y))
TypeError: unsupported operand type(s) for +: 'PSLiteral' and 'list'
jsvine commented 1 year ago

Hi @Laubeee, and thanks for your interest in this library. Based on the traceback, this seems to be an issue with pdfminer.six (pdfplumber's main dependency), and thus unfortunately unfixable by pdfplumber.

The .extract_words part seems to be a red herring; it's likely just that when you call .extract_words(...), this is the first time that pdfplumber is asking pdfminer.six for the precise data for that particular page. You can check this by swapping out page.extract_words(...) for page.objects.

Without seeing the PDF itself, I don't know whether this is something the folks running the pdfminer.six repository will be able to fix, or whether the PDF is just malformed. One way to check: Try repairing the PDF. Does the repaired PDF process OK?

Laubeee commented 1 week ago

Stumbled upon this again today... I see you have added open(... repair=True) option. This seems to fix it in my case, so thanks! :)

FYI: I also saw that the Superuser thread mentions that default might be better suited than prepress as per the ghostcript docs.

Please be aware that the /prepress setting does not indicate the highest quality conversion. Using any of these presets will involve altering the input, and as such may result in a PDF of poorer quality (compared to the input) than simply using the defaults. The 'best' quality (where best means closest to the original input) is obtained by not setting this parameter at all (or by using /default).

perhaps its worth parameterizing this? It would certainly need some test cases