jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.53k stars 659 forks source link

'LTChar' object has no attribute 'graphicstate' Error in a docker container #655

Closed has-abi closed 2 years ago

has-abi commented 2 years ago

I'm using pdfplumber to extract text from pdf pages using the crop function. It works fine on the local system but when deploying the project in a docker container it gives the Error: 'LTChar' object has no attribute 'graphicstate

Example of the Error: image

jsvine commented 2 years ago

Hm, interesting! Haven’t seen this one before. It’s a bit hard to diagnose without more information. Can you supply some/all of the following?:

has-abi commented 2 years ago

It seems to be just an Error in the latest version 0.6.2 because I installed the 0.6.1 version and it works on both environments. For the bug to be fixed in the latest version 0.6.2: Here is my Dockerfile:

FROM python:3.8

WORKDIR /usr/src/app

RUN pip install pipenv

RUN apt-get update && \
    apt-get install -y poppler-utils && \
    apt-get install -y libgl1

COPY ./Pipfile /usr/src/app/Pipfile
COPY ./Pipfile.lock /usr/src/app/Pipfile.lock

RUN pipenv install --system --deploy --ignore-pipfile

COPY . /usr/src/app

CMD flask run -h 0.0.0.0 -p 5000 

Here are the local dependencies with pip freeze:

astroid==2.11.5
attrs==21.4.0
beautifulsoup4==4.11.1
blis==0.7.7
catalogue==2.0.7
certifi==2021.10.8
cffi==1.15.0
chardet==4.0.0
charset-normalizer==2.0.12
click==8.1.3
cryptography==37.0.2
cycler==0.11.0
cymem==2.0.6
dill==0.3.4
docx2txt==0.8
filelock==3.7.0
flasgger==0.9.5
Flask==2.1.2
Flask-Cors==3.0.10
fonttools==4.33.3
gdown==4.4.0
http-constants==0.5.0
huggingface-hub==0.6.0
idna==3.3
importlib-metadata==4.11.3
importlib-resources==5.7.1
isort==5.10.1
itsdangerous==2.1.2
jarowinkler==1.0.2
Jinja2==3.1.2
joblib==1.1.0
jsonschema==4.5.1
kiwisolver==1.4.2
langcodes==3.3.0
lazy-object-proxy==1.7.1
MarkupSafe==2.1.1
matplotlib==3.5.2
mccabe==0.7.0
mistune==2.0.2
murmurhash==1.0.7
mypy-extensions==0.4.3
numpy==1.22.3
opencv-python==4.5.5.64
packaging==21.3
pandas==1.4.2
pathy==0.6.1
pdf2image==1.16.0
pdfminer==20191125
pdfminer.six==20220319
pdfplumber==0.6.2
pep8==1.7.1
Pillow==9.1.0
platformdirs==2.5.2
preshed==3.0.6
pycparser==2.21
pycryptodome==3.14.1
pydantic==1.8.2
pylint==2.13.9
pyparsing==3.0.9
pyrsistent==0.18.1
PySocks==1.7.1
python-dateutil==2.8.2
python-dotenv==0.20.0
pytz==2022.1
PyYAML==6.0
rapidfuzz==2.0.11
regex==2022.4.24
requests==2.27.1
sacremoses==0.0.53
scipy==1.8.0
seaborn==0.11.2
six==1.16.0
smart-open==5.2.1
soupsieve==2.3.2.post1
spacy==3.3.0
spacy-legacy==3.0.9
spacy-loggers==1.0.2
srsly==2.4.3
thinc==8.0.15
tokenizers==0.12.1
toml==0.10.2
tomli==2.0.1
torch==1.11.0
torchvision==0.12.0
tqdm==4.64.0
transformers==4.19.1
typer==0.4.1
typing_extensions==4.2.0
urllib3==1.26.9
Wand==0.6.7
wasabi==0.9.1
Werkzeug==2.1.2
wrapt==1.14.1
zipp==3.8.0

Here are the dependencies in the Docker container:

astroid==2.11.5
attrs==21.4.0
beautifulsoup4==4.11.1
blis==0.7.7
catalogue==2.0.7
certifi==2021.10.8
cffi==1.15.0
chardet==4.0.0
charset-normalizer==2.0.12
click==8.1.3
cryptography==37.0.2
cycler==0.11.0
cymem==2.0.6
dill==0.3.4
distlib==0.3.4
docx2txt==0.8
filelock==3.7.0
flasgger==0.9.5
Flask==2.1.2
Flask-Cors==3.0.10
fonttools==4.33.3
gdown==4.4.0
http-constants==0.5.0
huggingface-hub==0.6.0
idna==3.3
importlib-metadata==4.11.3
importlib-resources==5.7.1
isort==5.10.1
itsdangerous==2.1.2
jarowinkler==1.0.2
Jinja2==3.1.2
jsonschema==4.5.1
kiwisolver==1.4.2
langcodes==3.3.0
lazy-object-proxy==1.7.1
MarkupSafe==2.1.1
matplotlib==3.5.2
mccabe==0.7.0
mistune==2.0.2
murmurhash==1.0.7
numpy==1.22.3
opencv-python==4.5.5.64
packaging==21.3
pandas==1.4.2
pathy==0.6.1
pdf2image==1.16.0
pdfminer==20191125
pdfminer.six==20220319
pdfplumber==0.6.2
pep8==1.7.1
Pillow==9.1.0
pipenv==2022.5.2
platformdirs==2.5.2
preshed==3.0.6
pycparser==2.21
pycryptodome==3.14.1
pydantic==1.8.2
pylint==2.13.9
pyparsing==3.0.9
pyrsistent==0.18.1
PySocks==1.7.1
python-dateutil==2.8.2
python-dotenv==0.20.0
pytz==2022.1
PyYAML==6.0
rapidfuzz==2.0.11
regex==2022.4.24
requests==2.27.1
scipy==1.8.0
seaborn==0.11.2
six==1.16.0
smart-open==5.2.1
soupsieve==2.3.2.post1
spacy==3.3.0
spacy-legacy==3.0.9
spacy-loggers==1.0.2
srsly==2.4.3
thinc==8.0.15
tokenizers==0.12.1
tomli==2.0.1
torch==1.11.0
torchvision==0.12.0
tqdm==4.64.0
transformers==4.19.1
typer==0.4.1
typing_extensions==4.2.0
urllib3==1.26.9
virtualenv==20.14.1
virtualenv-clone==0.5.7
Wand==0.6.7
wasabi==0.9.1
Werkzeug==2.1.2
wrapt==1.14.1
zipp==3.8.0
jsvine commented 2 years ago

Thanks for these details. And strange! One thing I noticed: You seem to have both pdfminer.six (pdfplumber's main dependency) and pdfminer (that project's earlier incarnation) installed. Can you remove pdfminer from your Pipfile? (Or do some of your direct dependencies require that older project?) And, if so, do you still get the error?

has-abi commented 2 years ago

Yes, I did remove pdfminer and used pdfminer.six instead in the same project with pdfplumber 0.6.2 and it works for both environments!

Thanks @jsvine

jsvine commented 2 years ago

Ah, great — thanks for following up!

Hoyuri commented 2 years ago

Hi, I met the same issue here. Uninstall pdfminer.six and re-install it again, and then bug is fixed. It is just because LTChar object does not have attribute graphicstate. Reinstallation will fix the bug.

timsanders256 commented 2 years ago

Thanks to Hoyuri! It works like magic

Abe-Z-2022 commented 1 year ago

thanks for help. I'm facing the same issue, everything was working well for my script, up until I've decided to pip install pdfminer through my local pycharm terminal (working on virtual environment). So before that, I've just installed both pdfminer and pdfminersix through the pycharm packages interface. But ever since I pip installing it again through the terminal issues are in abundant. for a whole day my pycharm couldn't identify my pdfminer (although I'be unistalled it from pycharm and/or deleted the lib folder directly from my computer). Now after 24 hours and a few computer resets - the pdfminer is working again...apart from the - 'LTChar' object has no attribute 'graphicstate'. I just dont want to uninstall/delete again and getting back to ground zero. Would love to hear your thoughts

jsvine commented 1 year ago

Hi @Abe-Z-2022, without having more information/access to your setup, it's hard to definitively diagnose your issue. In general, however, I'd strongly recommend installing pdfplumber (as well as pretty much any Python library) in a Python virtual environment.