Closed Dabbrivia closed 4 years ago
I would be willing to commit some time and write the S3 requester-pays code if that would help.
Take a look at the code here: https://github.com/ranihorev/scihive-backend/blob/master/tasks/fetch_papers.py
We're fetching 200 papers at a time and then sleep for 5 seconds
Let me know if you have any questions.
We're fetching 200 papers at a time and then sleep for 5 seconds
I saw that here:
parser.add_argument('--results-per-iteration', type=int, default=200, help='passed to arxiv API')
parser.add_argument('--wait-time', type=float, default=5.0, help='lets be gentle to arxiv API (in number of seconds)')
If I'm not mistaken this is similar to what the orginal arxiv-sanity-preserver code fetch_papers does. As I mentioned above my instance got banned after aprox. 1000 pdfs downloaded.
But even if it would not get banned for 1K it just doesn't scale, does it? I have downloaded metadata for cond-mat. Since the year 2000 its about 200k articles!
Currently I have downloaded all the Arxiv papers via S3 (~1Mln articles, 1,3TB) and just decided to create full text and thumbnails for all those pdf. It is done via a shell script. I guess I will have to tweak the arxiv-sanity-preserver or code scihive-backend a bit so it can fetch those from the disk.
Good luck! Let us know if we can help with anything
I have tried out arxiv_sanity_preserver initial load that is just scraping arxiv and (as expected) got banned after approx 1000 articles downloaded. The issue https://github.com/karpathy/arxiv-sanity-preserver/issues/58 stays unresolved/closed there.
Now I have seen your front-end https://www.scihive.org/home which is much more advanced! So I just want to make sure the same ban will not happen here.
How do you handle the initial load in this project?
I see a function download_source_file but it seems to get data from arxiv online as well, correct? Arxiv urges users to use their tarballs from S3 bucket instead https://arxiv.org/help/bulk_data_s3. Do you have a solution for this in your code or is it a non-issue for you for some other reason? It would be helpful to understand your approach. Thanks!