Closed paul-tqh-nguyen closed 5 years ago
Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/c60a2b286aec98ec34140e0aa6703da2f3b41a65
This patch makes it so that when we're scraping arXiv, we also hit their mirror sites so that we don't get blocked after a while.
This also makes it so that we wait some number of seconds before hitting a URL (so that we can slow down how often we're hitting URLs).
Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/797c427bac6e398ab458dfafd028a8351e991af1
This patch updates the README wrt changes made in https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/c60a2b286aec98ec34140e0aa6703da2f3b41a65
I believe that's all there is to be done for this ticket.
Instead of scraping the same site, which will get us throttled, let's hit their mirror sites as well.
See https://arxiv.org/help/mirrors
Let's also wait ~30s before we hit a url so we can avoid being throttled.
Let's also mention that in our README so that no one gets upset that it's really slow.