Update version of `pdfminer-six` to `20240706`

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.67k stars 664 forks source link

Update version of `pdfminer-six` to `20240706` #1166

Open ValentinaGalataAA opened 3 months ago

ValentinaGalataAA commented 3 months ago

Please update the version of pdfminer-six to 20240706.

jsvine commented 3 months ago

There seems to be a bug in the latest release — https://github.com/pdfminer/pdfminer.six/issues/1004 — which also happens to be throwing errors in pdfplumber's test suite. I'll keep an eye out for pdfminer.six's next release, which hopefully fixes the bug.

dhdaines commented 3 months ago

There seems to be a bug in the latest release — pdfminer/pdfminer.six#1004 — which also happens to be throwing errors in pdfplumber's test suite. I'll keep an eye out for pdfminer.six's next release, which hopefully fixes the bug.

I fixed the bug :) https://github.com/pdfminer/pdfminer.six/pull/1027 hopefully it gets released soon!

jsvine commented 3 months ago

@dhdaines Wonderful, thanks!

chenxi-briink commented 2 months ago

@jsvine would you consider upgrade this dependency before the next release of pdfminer.six ?

pdfminer has a release cycle of about 5-6 months, so it can means another 5 months until next release, which is a bit too long imo
the current version throw similar errors too, which is what I encountered (please see below)

The project I'm working on uses pdfplumber in production, and when parsing the following PDF https://www.ge.com/sites/default/files/ge2021_sustainability_report.pdf, it raises TypeError: 'PDFObjRef' object is not iterable

I tested locally that pdfminer.six 20240706 could solve the issue. (I forced pdfplumber 0.10.2 and pdfminer.six 20240706 to coexist in order to verify it. However I couldn't do that in the project code because poetry is used there)

jsvine commented 2 months ago

Hi @chenxi-briink, can you try upgrading pdfplumber to the latest version, 0.11.3? Using that version, I'm able to parse the PDF you've cited with no problems/errors.

chenxi-briink commented 2 months ago

Hi @jsvine,

Sorry that I mis-typed the version number in my previous message

I forced pdfplumber 0.10.2 and pdfminer.six 20240706

should be: I forced pdfplumber 0.11.3 and pdfminer.six 20240706 to coexist.

yes that combination works for me.

however, the issue is, the requirements.txt of pdfplumber depends on pdfminer.six 20231228, it is the latter throws this exception.

File ~/foo/bar/.venv/lib/python3.11/site-packages/pdfminer/pdftypes.py:373, in PDFStream.decode(self)
    371     raise PDFNotImplementedError("Unsupported filter: %r" % f)
    372 # apply predictors
--> 373 if params and "Predictor" in params:
    374     pred = int_value(params["Predictor"])
    375     if pred == 1:
    376         # no predictor

TypeError: argument of type 'PDFObjRef' is not iterable

For in my production environment, in which poetry is used, I couldn't override the stated pdfminer.six version 20231228.

jsvine commented 2 months ago

Hi @chenxi-briink and thanks for the clarification. That's strange; I'm running the exact same combination and seeing no error. First, I set up this fresh environment:

python -m venv venv
source venv/bin/activate
pip install pdfplumber==0.11.3
pip freeze | grep pdf

... which outputs:

pdfminer.six==20231228
pdfplumber==0.11.3
pypdfium2==4.30.0

Then I ran this:

import pdfplumber

pdf = pdfplumber.open("./ge2021_sustainability_report.pdf")

for page in pdf.pages:
    assert len(pdf.objects)

... which completed without error.

chenxi-briink commented 2 months ago

Hi @jsvine,

Gee, by trying to replicate what you posted, I realised that the file I got turned out to be a modified version of the public available one I shared with you. For this modified file, the exception will occur when doing the same as you shared. (Sorry that I didn't double check cause I didn't expect there would be a modified version)

I uploaded this file to a public accessible GDrive folder , basically it's a shortened version of the original GE 2021 Sustainability Report. A PDF viewer could render it w/o problem.

jsvine commented 2 months ago

Thanks for providing the updated PDF, @chenxi-briink. Using that one, I can indeed replicate the error.

In this case, however, I don't plan on upgrading the dependency until at least the next pdfminer.six release — although doing so might fix your situation, it will likely break others (as confirmed pdfplumber's test suite). @dhdaines's fix in https://github.com/pdfminer/pdfminer.six/pull/1027 handles your PDF well; perhaps you can use his fork in the meantime?

As context: pdfminer.six is a pinned dependency in pdfplumber because changes to that library can have breaking changes for this one. I realize it can cause issues when someone wants to use a different specific version of pdfminer.six, but that tradeoff is preferable to all new installations of pdfplumber breaking.

chenxi-briink commented 2 months ago

Hi @jsvine , I totally understand the rational for not upgrading. Thanks for explaining and pointing me to @dhdaines 's fork, I might find sometime to give it a try.