mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

Implement Scrapy style concurrency controls? #39

Open philbudne opened 4 months ago

philbudne commented 4 months ago

Currently rss-fetcher has limits on concurrent requests and minimum interval between connections to rss servers.

My reading of Scrapy (and what I implemented in the queue-based story fetcher under control of SCRAPY_LATENCY) is to keep a moving average of page fetch time for each destination server, and to use AVG_FETCH_TIME/CONCURRENT_CONNECTION_GOAL to calculate the connection interval.

The only time this might matter is when the server has been off-line (down, or off the Internet) and there is a large backlog of feeds overdue for fetching.