Closed seanccho closed 3 years ago
Thanks for this. It appears that the maintainers of pdfminer3k have removed that version from PyPI. This version does, however, still exist in the Git repository. Try python -m pip install https://github.com/jaepil/pdfminer3k/archive/1.0.4.zip
, or replace the line in the requirements.txt
file containing pdfminer3k==1.0.4
with https://github.com/jaepil/pdfminer3k/archive/1.0.4.zip
.
Thank you for the update - I'll try replacing pdfminer3k==1.0.4
with https://github.com/jaepil/pdfminer3k/archive/1.0.4.zip
on a slightly separate topic, do you guys have an estimate of how much it would cost to run the full-text extraction on any of the cloud computing providers?
(basically running "400 core-hours using two Intel Xeon E5-2600 CPUs"
)
Well on Azure with 8 vCPUS, 16GB of RAM, a 1TB SSD (this process is IO bound), and 20GB of data egress (you pay for data to leave the cloud but not enter) costs $120/month. The computation naively will take 50 hours, so conservatively assuming 4 days of actual usage I estimate about $50. You'll have to play with the calculators and see for yourself. If you've got a desktop machine with 8 cores you may want to just use that and wait a week or so.
When I tried installing
pdfminer3k==1.0.4
, it couldn't find the version. Instead, it listed versions 1.3.2, 1.3.3, 1.3.4.So I ran
convert_directory_parallel
with 1.3.4, and I'm getting an error saying "[Errno 2] No such file or directory: 'pdf2txt.py'" where "pdf2txt.py" is defined here: https://github.com/mattbierbaum/arxiv-public-datasets/blob/f0b8a4fd17e7aeed38465ec00a63eb219fe1672e/arxiv_public_data/fulltext.py#L18Is that the correct package/version to use?
I came across this package which seems to be derived from the same parent project: https://github.com/pdfminer/pdfminer.six and comes with "pdf2txt.py" and am curious if this version of pdfminer would be compatible with
convert_directory_parallel
at https://github.com/mattbierbaum/arxiv-public-datasets/blob/f0b8a4fd17e7aeed38465ec00a63eb219fe1672e/arxiv_public_data/fulltext.py#L270 (and eventuallyrun_pdf2text
at https://github.com/mattbierbaum/arxiv-public-datasets/blob/f0b8a4fd17e7aeed38465ec00a63eb219fe1672e/arxiv_public_data/fulltext.py#L57)