Closed jshank closed 1 year ago
Strangely, now it's kind-of working. I messed about with the package versions some more and landed with the following:
(env) jshank@shanknas:~/pdf2vtt$ pip list
Package Version
------------------------ ----------
ansiwrap 0.8.4
boto3 1.17.95
botocore 1.20.112
certifi 2022.12.7
cffi 1.15.1
chardet 5.1.0
charset-normalizer 2.1.1
click 8.1.3
contextlib2 21.6.0
cryptography 39.0.0
cycler 0.11.0
editdistance 0.6.0
filelock 3.9.0
huggingface-hub 0.11.1
idna 3.4
jmespath 0.10.0
joblib 1.2.0
kiwisolver 1.4.4
matplotlib 3.4.2
nltk 3.8.1
numpy 1.24.1
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
opencv-python 4.5.5.64
packaging 22.0
pdf2image 1.15.1
pdfminer 20191125
pdfminer.six 20201018
Pillow 9.1.0
pip 22.3.1
pycparser 2.21
pycryptodome 3.16.0
pyparsing 3.0.9
pytesseract 0.3.8
python-dateutil 2.8.2
PyYAML 6.0
regex 2022.10.31
requests 2.28.1
s3transfer 0.4.2
schema 0.7.4
scikit-learn 1.0.2
scipy 1.9.3
sentence-transformers 2.1.0
sentencepiece 0.1.97
setuptools 65.6.3
six 1.16.0
sortedcontainers 2.4.0
textwrap3 0.9.2
threadpoolctl 3.1.0
tokenizers 0.13.2
torch 1.13.1
torchvision 0.14.1
tqdm 4.61.2
transformers 4.25.1
typing_extensions 4.4.0
urllib3 1.26.13
wheel 0.38.4
Apologies it took me so long to see this - I'd taken some time off from development. Glad to see it seemed to resolve, I believe this issue might be specifically with versioning between pdfminer and pdfminer.six.
I'm currently writing a complete replacement for the pdf extraction component of this pipeline that's both faster and more performant to remove the pdfminer dependency all-together.
I'm unable to run this script. Installation did require me to change a few things
opencv_python==4.5.5.64
as pip was unable to install 4.5.2.52pip install numpy --upgrade
)