Make ETL utilize mirror sites

paul-tqh-nguyen / arxiv_as_a_newspaper

arxiv.org portrayed as if it were a news paper.

0 stars 0 forks source link

Closed paul-tqh-nguyen closed 5 years ago

paul-tqh-nguyen commented 5 years ago

Instead of scraping the same site, which will get us throttled, let's hit their mirror sites as well.

Let's also wait ~30s before we hit a url so we can avoid being throttled.

Let's also mention that in our README so that no one gets upset that it's really slow.

paul-tqh-nguyen commented 5 years ago

This patch makes it so that when we're scraping arXiv, we also hit their mirror sites so that we don't get blocked after a while.

This also makes it so that we wait some number of seconds before hitting a URL (so that we can slow down how often we're hitting URLs).

paul-tqh-nguyen commented 5 years ago

paul-tqh-nguyen commented 5 years ago

I believe that's all there is to be done for this ticket.