jennis0 / pdf2vtt

MIT License
0 stars 0 forks source link

PDFMiner missing 2 required positional arguments: 'ncs' and 'graphicstate' #1

Closed jshank closed 1 year ago

jshank commented 1 year ago

I'm unable to run this script. Installation did require me to change a few things

jshank commented 1 year ago

Strangely, now it's kind-of working. I messed about with the package versions some more and landed with the following:

(env) jshank@shanknas:~/pdf2vtt$ pip list
Package                  Version
------------------------ ----------
ansiwrap                 0.8.4
boto3                    1.17.95
botocore                 1.20.112
certifi                  2022.12.7
cffi                     1.15.1
chardet                  5.1.0
charset-normalizer       2.1.1
click                    8.1.3
contextlib2              21.6.0
cryptography             39.0.0
cycler                   0.11.0
editdistance             0.6.0
filelock                 3.9.0
huggingface-hub          0.11.1
idna                     3.4
jmespath                 0.10.0
joblib                   1.2.0
kiwisolver               1.4.4
matplotlib               3.4.2
nltk                     3.8.1
numpy                    1.24.1
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
opencv-python            4.5.5.64
packaging                22.0
pdf2image                1.15.1
pdfminer                 20191125
pdfminer.six             20201018
Pillow                   9.1.0
pip                      22.3.1
pycparser                2.21
pycryptodome             3.16.0
pyparsing                3.0.9
pytesseract              0.3.8
python-dateutil          2.8.2
PyYAML                   6.0
regex                    2022.10.31
requests                 2.28.1
s3transfer               0.4.2
schema                   0.7.4
scikit-learn             1.0.2
scipy                    1.9.3
sentence-transformers    2.1.0
sentencepiece            0.1.97
setuptools               65.6.3
six                      1.16.0
sortedcontainers         2.4.0
textwrap3                0.9.2
threadpoolctl            3.1.0
tokenizers               0.13.2
torch                    1.13.1
torchvision              0.14.1
tqdm                     4.61.2
transformers             4.25.1
typing_extensions        4.4.0
urllib3                  1.26.13
wheel                    0.38.4
jennis0 commented 1 year ago

Apologies it took me so long to see this - I'd taken some time off from development. Glad to see it seemed to resolve, I believe this issue might be specifically with versioning between pdfminer and pdfminer.six.

I'm currently writing a complete replacement for the pdf extraction component of this pipeline that's both faster and more performant to remove the pdfminer dependency all-together.