Open ValentinaGalataAA opened 3 months ago
There seems to be a bug in the latest release — https://github.com/pdfminer/pdfminer.six/issues/1004 — which also happens to be throwing errors in pdfplumber
's test suite. I'll keep an eye out for pdfminer.six
's next release, which hopefully fixes the bug.
There seems to be a bug in the latest release — pdfminer/pdfminer.six#1004 — which also happens to be throwing errors in
pdfplumber
's test suite. I'll keep an eye out forpdfminer.six
's next release, which hopefully fixes the bug.
I fixed the bug :) https://github.com/pdfminer/pdfminer.six/pull/1027 hopefully it gets released soon!
@dhdaines Wonderful, thanks!
@jsvine would you consider upgrade this dependency before the next release of pdfminer.six ?
The project I'm working on uses pdfplumber
in production, and when parsing the following PDF
https://www.ge.com/sites/default/files/ge2021_sustainability_report.pdf, it raises TypeError: 'PDFObjRef' object is not iterable
I tested locally that pdfminer.six 20240706
could solve the issue. (I forced pdfplumber 0.10.2
and pdfminer.six 20240706
to coexist in order to verify it. However I couldn't do that in the project code because poetry
is used there)
Hi @chenxi-briink, can you try upgrading pdfplumber
to the latest version, 0.11.3
? Using that version, I'm able to parse the PDF you've cited with no problems/errors.
Hi @jsvine,
Sorry that I mis-typed the version number in my previous message
I forced pdfplumber 0.10.2 and pdfminer.six 20240706
should be: I forced pdfplumber
0.11.3
and pdfminer.six
20240706
to coexist.
yes that combination works for me.
however, the issue is, the requirements.txt
of pdfplumber
depends on pdfminer.six
20231228
, it is the latter throws this exception.
File ~/foo/bar/.venv/lib/python3.11/site-packages/pdfminer/pdftypes.py:373, in PDFStream.decode(self)
371 raise PDFNotImplementedError("Unsupported filter: %r" % f)
372 # apply predictors
--> 373 if params and "Predictor" in params:
374 pred = int_value(params["Predictor"])
375 if pred == 1:
376 # no predictor
TypeError: argument of type 'PDFObjRef' is not iterable
For in my production environment, in which poetry
is used, I couldn't override the stated pdfminer.six version 20231228
.
Hi @chenxi-briink and thanks for the clarification. That's strange; I'm running the exact same combination and seeing no error. First, I set up this fresh environment:
python -m venv venv
source venv/bin/activate
pip install pdfplumber==0.11.3
pip freeze | grep pdf
... which outputs:
pdfminer.six==20231228
pdfplumber==0.11.3
pypdfium2==4.30.0
Then I ran this:
import pdfplumber
pdf = pdfplumber.open("./ge2021_sustainability_report.pdf")
for page in pdf.pages:
assert len(pdf.objects)
... which completed without error.
Hi @jsvine,
Gee, by trying to replicate what you posted, I realised that the file I got turned out to be a modified version of the public available one I shared with you. For this modified file, the exception will occur when doing the same as you shared. (Sorry that I didn't double check cause I didn't expect there would be a modified version)
I uploaded this file to a public accessible GDrive folder , basically it's a shortened version of the original GE 2021 Sustainability Report. A PDF viewer could render it w/o problem.
Thanks for providing the updated PDF, @chenxi-briink. Using that one, I can indeed replicate the error.
In this case, however, I don't plan on upgrading the dependency until at least the next pdfminer.six
release — although doing so might fix your situation, it will likely break others (as confirmed pdfplumber
's test suite). @dhdaines's fix in https://github.com/pdfminer/pdfminer.six/pull/1027 handles your PDF well; perhaps you can use his fork in the meantime?
As context: pdfminer.six
is a pinned dependency in pdfplumber
because changes to that library can have breaking changes for this one. I realize it can cause issues when someone wants to use a different specific version of pdfminer.six
, but that tradeoff is preferable to all new installations of pdfplumber
breaking.
Hi @jsvine , I totally understand the rational for not upgrading. Thanks for explaining and pointing me to @dhdaines 's fork, I might find sometime to give it a try.
Please update the version of
pdfminer-six
to20240706
.