gpodder / gpodder

The gPodder podcast client.
http://gpodder.org/
GNU General Public License v3.0
1.26k stars 204 forks source link

Support Cache-Control/Expires and skipHours #1627

Open DamonHD opened 2 months ago

DamonHD commented 2 months ago

I see one client in my logs downloading the podcast feed apparently unconditionally every 20 minutes. New content is added to the feed less than once a month, so several MB of traffic and hundreds of polls to achieve that is a little excessive IMHO.

[29/Apr/2024:13:24:54 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 12856 "-" "gPodder/3.11.1 (+http://gpodder.org/) Linux"
[29/Apr/2024:13:44:55 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 12856 "-" "gPodder/3.11.1 (+http://gpodder.org/) Linux"
[29/Apr/2024:14:05:12 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 12856 "-" "gPodder/3.11.1 (+http://gpodder.org/) Linux"
[29/Apr/2024:14:25:01 +0000] "GET /rss/podcast.rss HTTP/1.1" 200 12856 "-" "gPodder/3.11.1 (+http://gpodder.org/) Linux"

I have not seen a 304 (Not Modified) response to it for over a week, which may be to do with my turning off ETags to work around a 2008-vintage Apache server bug that I discovered interefering with other clients' conditional requests. (They are now working better.)

Do you support If-Modified-Since, and if not could you please add support for it?

This traffic would also be mitigated if gPodder would use the Cache-Control or Expires headers that I supply (of 4h+) to avoid a premature re-poll in the first place, or indeed the skipHours in the RSS feed itself!

Rgds

Damon

elelay commented 2 months ago

Yes, we support If-Modified-Since. Please provide a test url if possible, to debug... We don't support Cache-Control or Expires at the moment, only Etag/If-Modified-Since.

DamonHD commented 2 months ago

Test URL: https://www.earth.org.uk/rss/podcast.rss

Thanks for the response!

Damon

elelay commented 2 months ago

Tested with (unreleased) gPodder 3.11.4 as well as gPodder 3.11.1: I get the 304 on refresh after initial import.

DamonHD commented 2 months ago

Hi,

I see your 304 with 3.11.4, and with your 3.11.1, but still none from the other 3.11.1 client.

Do you have a configuration option to force unconditional fetches that that other user might have enabled?

Again, thanks for taking the time to try to shake this out.

Rgds

Damon

elelay commented 2 months ago

Hum... no config option: the Last-Modified and Etag headers are stored in DB alongside the podcast and relevant headers are added to the query if present. They might start from a fresh database each time?

DamonHD commented 2 months ago

Well I'm impressed by your super powers ... the refreshes stopped entirely yesterday evening! Thank you! B^>

Can we convert this into a request to honour Cache-Control: max-age and/or the RSS skipHours for a future release?

Rgds

Damon

elelay commented 2 months ago

You understand that I did absolutely nothing, right?

Also I'm not familiar with cache-control: what algorithm would you use on a client like gPodder? Compute an expiration date from the server's or the client's date and max-age? What about proxies in between?

SkipHours and SkipDays seem overkill, but maybe for solar-powered server skipHours would make sense. I don't know how widely it's used.

DamonHD commented 2 months ago

Hi,

I understand that you did nothing! But still the problem went away. B^>

Cache-control / Expires will already be implemented by 99%+ of the servers that your client pulls feeds from, so you would get a big improvement immediately in bandwidth use from implementing it. You should probably build in a few limits to deal with a small number of broken/misconfigured servers.

Basically a simple use of Cache-control max-age / Expires tells you the time before which you should not bother to poll again as there is unlikely to be a change in the feed. No poll is cheaper than the one you don't make!

(Proxies should already do the right things with these: it's a very common pathway used for all affected Web pages too.)

Rgds

Damon

DamonHD commented 2 months ago

One thing that I have just seen with another client (Spotify) is that it generated technically incorrect If-Modified-Since headers, eg:

If-Modified-Since: Tue, 7 May 2024 09:46:22 GMT

rather than:

If-Modified-Since: Tue, 07 May 2024 09:46:22 GMT

... later ...

Now actually I have found a gPodder header and the problem seems more subtle (extract from Apache mod_log_forensic output):

|GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|User-agent:gPodder/3.11.1 (+http%3a//gpodder.org/) Linux|Accept-Encoding:gzip, deflate, br|Accept:*/*|Connection:keep-alive|If-Modified-Since:Tue, 07 May 2024 09%3a46%3a07 GMT|If-None-Match:"239e-6168531446d9c"

This client is sending a fine If-Modified-Since, but is also sending an If-None-Match even though I haven't been generating any ETag for many days!

Is it possible that the latter is accidentally retained in the gPodder DB?

Rgds

Damon

DamonHD commented 2 months ago

And I just saw a good request come in, possibly from you (!):

|GET /rss/podcast.rss HTTP/1.1|Host:www.earth.org.uk|User-agent:gPodder/3.11.4 (+http%3a//gpodder.org/) Linux|Accept-Encoding:gzip, deflate, br|Accept:*/*|Connection:keep-alive|If-Modified-Since:Mon, 29 Apr 2024 09%3a52%3a04 GMT

Rgds

Damon

elelay commented 2 months ago

yes, completely: we keep the stored value if the server returns None or empty.

I don't know if some feeds need us to resend if-none-match or if-modified-since if it's not there anymore. Let's make a quick fix where if either last-modified or etag is present and the other is missing we clear the stored value.

Also I see now you send 429 Retry-After 50620s. This we could handle nicely if we stored a "not before" date in database. It is kind of working right now, with the feed refresh stuck, I guess for 14 hours... We'll need to handle it better...

DamonHD commented 2 months ago

Yes, your fix sounds the right thing to do: you should not retain either of those values not sent on the last 200 or 304 reponse AFAIK...

(I am sending the 429s to fight off some rabid unconditional requesters such as Amazon and save a little bandwidth.)

That Retry-After value is what you should also see in Cache-Control max-age for a 200 or 304 response, and my intention is that they should be treated the same way by the client, just as you suggest.

Rgds

Damon

DamonHD commented 2 months ago

BTW, just in case your client encounters any feed server with very broken config, and in this world there are bound to be some, I suggest that you cap the not-before delay at (say) 24h, at least initially, which still represents a lot of saving over being polled hourly for example.

Rgds

Damon

auouymous commented 2 months ago

Do you have a configuration option to force unconditional fetches that that other user might have enabled?

My debug extension allows the user to manually reset the etag for a feed, for testing purposes. It would be easy for a user to write their own extension that cleared all etags on startup and then run the cli program from cron. Or they use a simple sqlite command to clear all etags before running from cron. Or it isn't gpodder at all and someone is just using the user-agent.

I suggest that you cap the not-before delay at (say) 24h

A 24h delay is too much, you might update today at 5PM and tomorrow at 4PM, oops, gpodder says to wait an hour. I don't think anyone should be updating more than once a day, but imposing a delay could cause problems for some users and some feeds. Say we use a 12h delay, and someone's gpodder is polling a weather feed every 30 minutes, and the feed allows this (maybe it is even the user's feed on another device) but cache control isn't set. It isn't our job to bend the standards just because the server didn't enable cache control.

But we can and should support cache control to avoid polling feeds sooner than they requested.

DamonHD commented 2 months ago

As it happens a 12h cap would interact poorly sometimes wth what I have currently set up that has skipHours from 22:00Z to 07:59Z, and so max-age / Retry-After set as follows (Apache):

# Set RSS feed cache time, ie minimum poll interval.
# Give podcast RSS and similar feeds longer expiry out of work hours.
# Have any Retry-After match what the expiry would have been.
<If "%{TIME_HOUR} -lt 8 || %{TIME_HOUR} -gt 21">
    # This should be long enough to jump out of skipHours in one go.
    ExpiresByType application/rss+xml "access plus 10 hours 7 minutes"
    Header always set Retry-After "36420" env=REDIRECT_RSS_RATE_LIMIT
</If>
<ElseIf "%{TIME_HOUR} -gt 17">
    # Jump expiry right over coming skipHours block.
    ExpiresByType application/rss+xml "access plus 14 hours 7 minutes"
    Header always set Retry-After "50620" env=REDIRECT_RSS_RATE_LIMIT
</ElseIf>
<Else>
    # Give podcast RSS and similar feeds a default expiry time of ~4h.
    ExpiresByType application/rss+xml "access plus 4 hours 7 minutes"
    Header always set Retry-After "14720" env=REDIRECT_RSS_RATE_LIMIT
</Else>

I'm not suggesting that gPodder bend to suit my feed's particular whims, but a daily cap, sensibly applied, feels reasonable to me. More than an order of magnitude less wasteful for slow feeds than the common hourly polling, and two orders less than iTunes/Amazon/Spotify!

(Note that I am am thinking of pushing the default expiry up to 8h7 or 12h7.)

Rgds

Damon