paul-tqh-nguyen / arxiv_as_a_newspaper

arxiv.org portrayed as if it were a news paper.
0 stars 0 forks source link

Make ETL utilize mirror sites #6

Closed paul-tqh-nguyen closed 5 years ago

paul-tqh-nguyen commented 5 years ago

Instead of scraping the same site, which will get us throttled, let's hit their mirror sites as well.

See https://arxiv.org/help/mirrors

Let's also wait ~30s before we hit a url so we can avoid being throttled.

Let's also mention that in our README so that no one gets upset that it's really slow.

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/c60a2b286aec98ec34140e0aa6703da2f3b41a65

This patch makes it so that when we're scraping arXiv, we also hit their mirror sites so that we don't get blocked after a while.

This also makes it so that we wait some number of seconds before hitting a URL (so that we can slow down how often we're hitting URLs).

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/797c427bac6e398ab458dfafd028a8351e991af1

This patch updates the README wrt changes made in https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/c60a2b286aec98ec34140e0aa6703da2f3b41a65

paul-tqh-nguyen commented 5 years ago

I believe that's all there is to be done for this ticket.