jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
5.99k stars 618 forks source link

unsupported operand type(s) for *: 'float' and 'PSLiteral' #1148

Closed sasuke00 closed 1 week ago

sasuke00 commented 4 weeks ago

This is the first code that I try to process one page and it can process the pdf successfully:

pdf = pdfplumber.open(pdf, repair=True)
page = pdf.pages[1]
words_block = page.extract_words(x_tolerance=25, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=True, use_text_flow=False, extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)

However, when I change the code to loop through all the pages of the PDF (it has more than one page), it will come out an error

with pdfplumber.open(pdf, repair=True) as pdf:
    for page_number in range(len(pdf.pages)):
            page = pdf.pages[page_number]
            words_block = page.extract_words(x_tolerance=25, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=True, use_text_flow=False, extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)

This is the error message:

Python39\lib\site-packages\pdfminer\utils.py", line 244, in mult_matrix 
    a0 * a1 + c0 * b1,
TypeError: unsupported operand type(s) for *: 'float' and 'PSLiteral'

and I traceback the error comes from the extract_words() method

jsvine commented 4 weeks ago

Given the stack trace you're sharing, this seems like the error arises from pdfminer.six (the main dependency of pdfplumber). There's probably not much pdfplumber can do to fix this, unfortunately. If you share the PDF, however, I can let you know more definitively.

sasuke00 commented 2 weeks ago

Hi @jsvine , I have found that the issue can solved by set up the ghostscript path, probably the problem come out from the repair=True. Although the problem has been fixed, but I still would like to know is there any alternative way to solve the problem instead of relying on this ghostscript?

Problem solved using gs_path: with pdfplumber.open(pdf_path, repair=True, gs_path=r"C:\Program Files\gs\gs10.02.0\bin\gswin64.exe") as pdf:

jsvine commented 1 week ago

From what you describe, it seems that there's an error in the PDF itself, which is causing problems for pdfminer.six. The main solutions would be repair=True or modifying pdfminer.six so that it can handle that particular type of error. Closing this issue for now, since there's not much pdfplumber can do, but feel free to continue the discussion.