Open michaeljohnclancy opened 5 years ago
The Collector class implemented in #17 partially solves this by staggering requests between different news sources. This means that requests are more spread out for each website, reducing the chance of being throttled.
Requests are still not threaded, however. Once we are able to request more than one article at a time, each source's Collector can run simultaneously. This means that each batch of article requests can contain many different news sources. This is in contrast to using lists instead of generators, where each batch of requests would be sent to a single website. The advantage is that throttling is much less likely to kick in, since individual websites receive only one request at a time.
Sites like the bbc will throttle our requests if we request too many pages at once. We need to investigate how this is best bypassed by already available scrapers.