Closed sasuke00 closed 1 week ago
Given the stack trace you're sharing, this seems like the error arises from pdfminer.six
(the main dependency of pdfplumber
). There's probably not much pdfplumber
can do to fix this, unfortunately. If you share the PDF, however, I can let you know more definitively.
Hi @jsvine , I have found that the issue can solved by set up the ghostscript path, probably the problem come out from the repair=True
. Although the problem has been fixed, but I still would like to know is there any alternative way to solve the problem instead of relying on this ghostscript?
Problem solved using gs_path:
with pdfplumber.open(pdf_path, repair=True, gs_path=r"C:\Program Files\gs\gs10.02.0\bin\gswin64.exe") as pdf:
From what you describe, it seems that there's an error in the PDF itself, which is causing problems for pdfminer.six
. The main solutions would be repair=True
or modifying pdfminer.six
so that it can handle that particular type of error. Closing this issue for now, since there's not much pdfplumber
can do, but feel free to continue the discussion.
This is the first code that I try to process one page and it can process the pdf successfully:
However, when I change the code to loop through all the pages of the PDF (it has more than one page), it will come out an error
This is the error message:
and I traceback the error comes from the extract_words() method