Allow configuring of RSS feed fetch interval

bkil commented 1 year ago

If an RSS feed is generated dynamically on demand by a dynamic backend, each fetch can take a lot of server resources. It can sometimes be adequate if such a feed is only fetched hourly.

RSS best practices dictated that fetching rate should be adapted based on how often the feed content changes, and lacking that, default to 1 hour. Cache headers might be missing or set incorrectly.

According to our experience, the old matrix RSS bot ignored such advice and brute forced every feed about once every 5 minutes.

Half-Shot commented 1 year ago

Hookshot typically checks the cache headers and etags, and uses Cache Headers to try and allow the server to quickly no-op in case of no new changes. This is the preferred model because it means we can responsively show new content but allow servers to avoid resending lots of data.

There is probably room to make hookshot "learn" which feeds are faster than others and sort them into priority buckets, we've just not got that far into the implementation yet.

The present behavior of hookshot is that it's configured to try to hit every feed within a certain time period (currently 10 minutes for integrations.ems.host), although given the increasing number of feeds to process and our current linear queue system, this ends up being much longer.

bkil commented 1 year ago

Some hosts have caching headers set incorrectly and that might bloat bandwidth requirements on static backends. The subscriber would have more information whether this is the case and should be the one who can make an informed decision about overriding such headers (although, the bot could itself also detect this heuristically).

A dynamic RSS backend always computes the feed every time on demand, so caching headers will (should) state not cacheable. It could also only tell whether the output had changed or not since a past invocation if it finishes processing, hence why If-None-Match and If-Modified-Since is not applicable either. This could of course be improved by an elaborate architecture and good algorithms on the part of the feed host, but we all know that many volunteer ran FOSS projects lack such polish in their implementation.

In case of OpenStreetMap-based backends, caching for up to say 1 minute is reasonable, as that is the granularity of database replication. However, the subscribers themselves have more information about the latency they are willing to tolerate to play nicer with the given host. For example, real time alerting definitely needs to poll every minute, but if we just want to chat about or learn about changes and act according to best effort basis, hourly polling is sufficient.

matrix-org / matrix-hookshot

Allow configuring of RSS feed fetch interval #766