mattbierbaum / arxiv-public-datasets

A set of scripts to grab public datasets from resources related to arXiv
https://arxiv.org/abs/1905.00075
MIT License
399 stars 62 forks source link

No matching distribution found for pdfminer3k==1.0.4 #11

Closed seanccho closed 3 years ago

seanccho commented 3 years ago

When I tried installing pdfminer3k==1.0.4, it couldn't find the version. Instead, it listed versions 1.3.2, 1.3.3, 1.3.4.

So I ran convert_directory_parallel with 1.3.4, and I'm getting an error saying "[Errno 2] No such file or directory: 'pdf2txt.py'" where "pdf2txt.py" is defined here: https://github.com/mattbierbaum/arxiv-public-datasets/blob/f0b8a4fd17e7aeed38465ec00a63eb219fe1672e/arxiv_public_data/fulltext.py#L18

Is that the correct package/version to use?

I came across this package which seems to be derived from the same parent project: https://github.com/pdfminer/pdfminer.six and comes with "pdf2txt.py" and am curious if this version of pdfminer would be compatible with convert_directory_parallel at https://github.com/mattbierbaum/arxiv-public-datasets/blob/f0b8a4fd17e7aeed38465ec00a63eb219fe1672e/arxiv_public_data/fulltext.py#L270 (and eventually run_pdf2text at https://github.com/mattbierbaum/arxiv-public-datasets/blob/f0b8a4fd17e7aeed38465ec00a63eb219fe1672e/arxiv_public_data/fulltext.py#L57)

colinclement commented 3 years ago

Thanks for this. It appears that the maintainers of pdfminer3k have removed that version from PyPI. This version does, however, still exist in the Git repository. Try python -m pip install https://github.com/jaepil/pdfminer3k/archive/1.0.4.zip, or replace the line in the requirements.txt file containing pdfminer3k==1.0.4 with https://github.com/jaepil/pdfminer3k/archive/1.0.4.zip.

seanccho commented 3 years ago

Thank you for the update - I'll try replacing pdfminer3k==1.0.4 with https://github.com/jaepil/pdfminer3k/archive/1.0.4.zip

on a slightly separate topic, do you guys have an estimate of how much it would cost to run the full-text extraction on any of the cloud computing providers? (basically running "400 core-hours using two Intel Xeon E5-2600 CPUs")

colinclement commented 3 years ago

Well on Azure with 8 vCPUS, 16GB of RAM, a 1TB SSD (this process is IO bound), and 20GB of data egress (you pay for data to leave the cloud but not enter) costs $120/month. The computation naively will take 50 hours, so conservatively assuming 4 days of actual usage I estimate about $50. You'll have to play with the calculators and see for yourself. If you've got a desktop machine with 8 cores you may want to just use that and wait a week or so.