ranihorev / scihive-backend

12 stars 4 forks source link

How to do an inital load of the articles and not get banned by Arxiv? #4

Closed Dabbrivia closed 4 years ago

Dabbrivia commented 4 years ago

I have tried out arxiv_sanity_preserver initial load that is just scraping arxiv and (as expected) got banned after approx 1000 articles downloaded. The issue https://github.com/karpathy/arxiv-sanity-preserver/issues/58 stays unresolved/closed there.

Now I have seen your front-end https://www.scihive.org/home which is much more advanced! So I just want to make sure the same ban will not happen here.

How do you handle the initial load in this project?

I see a function download_source_file but it seems to get data from arxiv online as well, correct? Arxiv urges users to use their tarballs from S3 bucket instead https://arxiv.org/help/bulk_data_s3. Do you have a solution for this in your code or is it a non-issue for you for some other reason? It would be helpful to understand your approach. Thanks!

Dabbrivia commented 4 years ago

I would be willing to commit some time and write the S3 requester-pays code if that would help.

ranihorev commented 4 years ago

Take a look at the code here: https://github.com/ranihorev/scihive-backend/blob/master/tasks/fetch_papers.py

We're fetching 200 papers at a time and then sleep for 5 seconds

Let me know if you have any questions.

Dabbrivia commented 4 years ago

We're fetching 200 papers at a time and then sleep for 5 seconds

I saw that here:

parser.add_argument('--results-per-iteration', type=int, default=200, help='passed to arxiv API')
parser.add_argument('--wait-time', type=float, default=5.0, help='lets be gentle to arxiv API (in number of seconds)')

If I'm not mistaken this is similar to what the orginal arxiv-sanity-preserver code fetch_papers does. As I mentioned above my instance got banned after aprox. 1000 pdfs downloaded.

But even if it would not get banned for 1K it just doesn't scale, does it? I have downloaded metadata for cond-mat. Since the year 2000 its about 200k articles!

Currently I have downloaded all the Arxiv papers via S3 (~1Mln articles, 1,3TB) and just decided to create full text and thumbnails for all those pdf. It is done via a shell script. I guess I will have to tweak the arxiv-sanity-preserver or code scihive-backend a bit so it can fetch those from the disk.

ranihorev commented 4 years ago

Good luck! Let us know if we can help with anything