Closed Laubeee closed 1 year ago
Hi @Laubeee, and thanks for your interest in this library. Based on the traceback, this seems to be an issue with pdfminer.six
(pdfplumber
's main dependency), and thus unfortunately unfixable by pdfplumber
.
The .extract_words
part seems to be a red herring; it's likely just that when you call .extract_words(...)
, this is the first time that pdfplumber
is asking pdfminer.six
for the precise data for that particular page. You can check this by swapping out page.extract_words(...)
for page.objects
.
Without seeing the PDF itself, I don't know whether this is something the folks running the pdfminer.six
repository will be able to fix, or whether the PDF is just malformed. One way to check: Try repairing the PDF. Does the repaired PDF process OK?
Stumbled upon this again today... I see you have added open(... repair=True)
option. This seems to fix it in my case, so thanks! :)
FYI: I also saw that the Superuser thread mentions that default
might be better suited than prepress
as per the ghostcript docs.
Please be aware that the /prepress setting does not indicate the highest quality conversion. Using any of these presets will involve altering the input, and as such may result in a PDF of poorer quality (compared to the input) than simply using the defaults. The 'best' quality (where best means closest to the original input) is obtained by not setting this parameter at all (or by using /default).
perhaps its worth parameterizing this? It would certainly need some test cases
Describe the bug
I'm getting
unsupported operand type(s) for +: 'PSLiteral' and 'list'
inpage.extract_words(x_tolerance=1, y_tolerance=1, horizontal_ltr=True, vertical_ttb=False, keep_blank_chars=True, use_text_flow=True)
.The PDF has 3 pages and the first 2 do not encounter this error. Other PDFs work fine.
Code to reproduce the problem
PDF file
Unfortunately I cannot share the PDF from the customer (contains sensitive graphical content)
Expected behavior
word extraction without exception
Actual behavior
method threw an exception
Environment
Additional context