mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

Prune old stories by count instead of age #25

Open philbudne opened 8 months ago

philbudne commented 8 months ago

Currently old stories are pruned by date, so entries from slow/static feeds time out, and "new" articles keep on being discovered.

The fetch_events table is pruned to a fixed number of entries, doing the same for the stories table might avoid the rediscovery problem.

philbudne commented 1 month ago

A recent thought: If the stories table had a "last_seen" column (updated each time the URL is found in a feed), we could use it to prevent aging out entries from unchanging feeds (would need to compare story.last_seen to the last time new/different content was returned (http_last_modified?).

This would increase database write load, but would prevent duplicates generated every time a URL from a static feed is expired from the stories table.