nextcloud / news

:newspaper: RSS/Atom feed reader
https://apps.nextcloud.com/apps/news
GNU Affero General Public License v3.0
862 stars 186 forks source link

Broken caching behavior #2079

Closed flisk closed 1 year ago

flisk commented 1 year ago

I'm seeing a lot of requests for my blog feed from a NextCloud-News/1.0 user agent that isn't caching responses properly, because it's sending an If-Modified-Since header of Wed, 01 Jan 1800 00:00:00 GMT with no If-None-Match header for ETag matching.

Needless to say I'm not particularly happy that these clients are fetching full copies of my feed each time for seemingly no reason, and I'm also not very happy that this user agent reports itself as being "version 1.0" when that string has been hardcoded as the client's user agent since forever and doesn't reflect the actual software version whatsoever.

Pure speculation, but this broken caching behavior might be why people are blocking your user agent (#2049). I've myself redirected requests with that 1800's If-Modified-Since header to an alternate feed telling users to please fix their clients.

For reference, I don't seem to be the first or only person upset about this: https://utcc.utoronto.ca/~cks/space/blog/web/VeryOldIfModifiedSince

So, would you folks mind fixing your client, and maybe making your user agent header a little more informative while you're at it? :)

SMillerDev commented 1 year ago

What is your feed? Because we send the modified date that Feed-IO tells us the feed reports.

flisk commented 1 year ago

It's https://flisk.xyz/feeds/blog.atom.xml, but the client is also requesting it through https://flisk.xyz/feeds/all.atom.xml, which is a legacy URL that 301-redirects to the first URL. Maybe the redirect breaks caching?

SMillerDev commented 1 year ago

@alexdebril do you have any idea what could be causing this?

alexdebril commented 1 year ago

I don't think the redirect breaks caching because feed-io follows the redirect in a transparent manner and it uses the feed's content to get its last update date (because most of the time the headers are wrong - yeah, feed publishers make mistakes too).

Actually, the recommended usage of feed-io is this:

Because before each GET request, feed-io performs a HEAD to know if the feed was updated since last hit.

Plus, feed-io analyses the feed to guess the best moment to hit the feed again. For that, it computes an average delay between items. So if a feed is updated only 3 times a day there's a high chance that feed-io will suggest to come back 8 hours later. Of course, there's a max value to avoid waiting for weeks before next hit.

alexdebril commented 1 year ago

I tested with both URLs and I always get this from feed-io's command-line:

feed last modified: 2023-01-23T23:00:00+00:00
computed next update: 2023-03-29T20:36:00+00:00
minimum interval between items: 0 days, 0 hours, 0 minutes, 0 seconds
median interval: 59 days, 0 hours, 0 minutes, 0 seconds
average interval: 138 days, 17 hours, 40 minutes, 0 seconds
maximum interval: 928 days, 23 hours, 0 minutes, 0 seconds

I don't know how often NextCloud fetches this feed but even twice a day is far too much for sure.

alexdebril commented 1 year ago

The strategy I've explained above won't work for this feed:

curl -I https://flisk.xyz/feeds/blog.atom.xml
HTTP/1.1 200 OK
Server: nginx
Date: Fri, 27 Jan 2023 17:26:44 GMT
Content-Type: text/xml
Content-Length: 162220
**Last-Modified: Fri, 27 Jan 2023 15:02:39 GMT**
Connection: keep-alive
Vary: Accept-Encoding
ETag: "63d3e78f-279ac"
Strict-Transport-Security: max-age=63072000
Accept-Ranges: bytes

Its last item was published on 2023-01-23T23:00:00+00:00. As a consequence, feed-io will always perform the GET request.

To optimize this we must have two modified-since dates: one for HTTP server, the other for feed filtering.

Grotax commented 1 year ago

What is your feed? Because we send the modified date that Feed-IO tells us the feed reports.

I don't think we do that. See #975 we used readSince with a timestamp and some code to handle it. But that didn't work in all cases. So currently feed-io always gets null as time parameter.

I guess the version string could be adjusted to use the actual news version, if somone wants to implement that.

I don't know how often NextCloud fetches this feed but even twice a day is far too much for sure.

Per default news has a schedule of once per hour and it will fetch all feeds. We don't use that function to guess the next inteval (I'm not sure if that was already in 4.x which we are bound to due to php 7.4..)

alexdebril commented 1 year ago

But that didn't work in all cases. So currently feed-io always gets null as time parameter

Yes and I remember we did that because of the many ways HTTP servers are providing news feeds, then the only valid option was to fetch the feed and analyse its content to get if it has been updated or not. We have so many examples of users complaining about out of date feeds leading to the conclusion that the feed itself was causing the issue for various reasons like fuzzy headers for instance. Or some valid but really surprising implementations like this security blog exposing the publication date only in the HTTP headers but not in the XML stream (sorry but... Why ???).

So trying to find a solution to satisfy a few upset feed publishers will lead us to another round of troubles with angry users.

I suggest to change nothing and close the issue.

flisk commented 1 year ago

I strongly object to @alexdebril's suggestion. Not even attempting to cache responses because some servers return weird headers punishes the vast majority of well-behaved servers that return perfectly useful headers for caching.

I would instead suggest the following:

If you're concerned about servers sending bad cache headers – which I'm sure happens in the real world, but if it does, most other feed readers will also miss new items like @alexdebril described – you could use a strategy like fetching feeds in full without caching once a day to make sure nothing gets lost.

Intentionally refusing to cache responses because your client assumes that all servers are incapable of providing useful cache headers is irrational and antisocial, and I really would prefer not rate-limiting access to my feeds based on clients user agents. Just to reiterate: this behavior is probably why people are blocking your user agent (#2049). For some reason, this doesn't seem to be a problem with the majority of other feed readers.

alexdebril commented 1 year ago

If you need it so hard then I suggest you to submit some pull requests. You'll have to submit 4 of them:

In feed-io the package you'll need to update is the Adapter. And of course the unit tests are to be updated accordingly.

Looking forward to reviewing your PRs

flisk commented 1 year ago

I think you've misunderstood my motivation. It would be far less work to block the Nextcloud News user agent. This is your proverbial dog pooping on my lawn, not the other way around.

SMillerDev commented 1 year ago

In that metaphor you're currently asking to cut the dogs legs off and we're telling you that the dogs well-being is more important than your lawn to us.

flisk commented 1 year ago

In keeping with the metaphor, I'm asking you to teach your dog not to poop on my lawn. I think that's reasonable, because the vast majority of other dogs also don't poop on my lawn, which seems to imply this is possible.

Grotax commented 1 year ago

I think this discussion is pointless we looked at the issue and already back when the code was changed we knew that this would cause more traffic.

But since many people complained about missing items in their feed due to strange setups on many feeds we changed it the way it is now. And we have this since 2 years already.

In the current setup for news there is not much we can do, if we re-add the fetching with the last modified then people will complain again about missing items. Or we leave it as is and it causes more traffic.

I don't see an alternative solution to that.

flisk commented 1 year ago

I outlined a solution previously that I'll reiterate for clarity:

If your client continues to waste people's bandwidth, they either need to rate-limit it on their end, which is silly, or they'll choose to block your user agent outright, which is unfortunate but clearly already being done.

Grotax commented 1 year ago

I saw that but that is not possible at the moment, the lib doesn't support that to my understanding.

And unless someone implements that I don't see a way for us how to use that.

Full fetch once a day would also require quite a bit of code I fear not sure if someone wants to contribute that

flisk commented 1 year ago

I'm leaving this nginx snippet here for any other admins who want to mitigate this:

location = /url-to-your-feed-here {
        if ($http_if_modified_since = "Wed, 01 Jan 1800 00:00:00 GMT") {
                return 429; # or some other response you think is appropriate
        }
}

I personally redirect those requests to an alternate feed file with a notice that my server won't serve clients that don't cache responses properly:

root /var/www/flisk.xyz;
rewrite ^.*$ /fix-your-client.xml break;
alexdebril commented 1 year ago

@Flisk a constructive contribution would be to submit a PR for improving this, that's how open source software works.

I've shared with you the basic information you need to do it. If you want more info to make your contribution, feel free to contact me on feed-io 's repository.

alexdebril commented 1 year ago

And one last thing: your nginx rule will stand 3 months, 6 maximum. Because changing this hard coded value will be really easy as soon as some end-user will complain about your feed not being consumed. I can also play with your nerves and pick a random value in a two centuries range, it won't have any impact on NextCloud.

But I prefer to act as a gentleman and review your PR as soon as you submit it.

flisk commented 1 year ago

@alexdebril The gentlemanly thing here would be to write your software in a way that doesn't waste other people's resources, and not to ask those other people to fix it themselves when they tell you what your software is doing. I've already spent my free time trying to explain to you that this is a problem and suggesting solutions, and I honestly find your suggestion that I teach your dog not to poop on my lawn a little insulting. It's not my dog.

alexdebril commented 1 year ago

Ok let me get this straight: if you don't want any "dog in your lawn", then shut down your server.

But if you want to improve things, make a PR as many people did before you and like a lot of people will do in those repositories. If you can't write the code for some reason, just write a clean ticket describing the desired behaviour ONLY, accept our views regarding past challenges and the possible solutions we could find, be patient and maybe one day a contributor will write the code, merge it and release it.

I'm spending free time too on this and guess what, I spent a lot of it in the past to make users happy for free. Considering the huge amount of hours we spent coding this and supporting developers, users, finding solutions to tricky issues and stuff like this, OF COURSE you can help and make a contribution. And the bare minimum would be to show some respect and keep your "dirty lawns" metaphor for yourself, nobody needs this here.

flisk commented 1 year ago

I won't be participating in this discussion any further. Problem's solved on my end.

Mynacol commented 1 year ago

I noticed more than 50MB IP traffic every nextcloud cron job/ every five minutes as well. While I'm not really impacted by that, I think it is a giant waste of resources. As I understand both sides of the story (caching needed, but some websites implement If-Modified-Since wrongly, leading to missing feed entries), what are potential next steps? My ideas:

Re-Implement If-Modified-Since with fallback

The usual requests use conditional GET to be a civilized client. Once every day (how do we get this timestamp/signal?) we query all feeds unconditionally.

This way we use caching properly, but catch feed items on broken servers, but with a delay. Seems to be the most minimal solution to me.

Enhancement: Check already saved last-modified feed entries

If we obtained new/modified items since last day's unconditional requests, we can skip this feed or do a normal conditional request. This slightly rewards well-behaving servers and minimizes traffic. Should be fairly easy to implement.

Enhancement: Smartly detect non-behaving servers

If we detect new items on unconditional requests that we should have gotten with the conditional requests, we mark the server broken and do an unconditional request on every query. This would alleviate the longer waiting times for misbehaving servers on the above solutions, but is more difficult to implement, requiring new database fields.

Furthermore, we might reduce the frequency of unconditional requests on seemingly well-behaving feeds. Even more complicated, not important for now.

Use ETags and If-None-Match

This should be harder to break server-side and avoids the whole date parsing problems of If-Modified-Since. Unfortunately, I believe this would require changes to Feed-IO, Nextcloud News picking them up, which only happens when supported Nextcloud versions require at least PHP 8.1. Therefore: distant future.

alexdebril commented 1 year ago

@Mynacol thanks for this, it's very clear. I think it can influence the way feed-io hits feeds and filters the output so the first step would be to improve the way feed-io interacts with feeds (+ it would benefit to other clients too).

I need to "extract" parts from your comment into a fresh new issue inside feed-io's backlog. As nextcloud is one of the biggest contributor, I still maintain version 4.9.x so migrating to PHP 8.1 won't be a mandatory step. Even if I highly encourage everyone to upgrade!

Mynacol commented 1 year ago

Should we close this issue as https://github.com/nextcloud/news/pull/2119 is now implemented and released? Thx a lot @Grotax for the implementation btw!

The ETag improvement is not urgent and should be moved to another issue if someone is interested tracking that.

Grotax commented 1 year ago

Hey yea I think it can be closed, we do send the httpLastModified now, additionally headers are now send to use compression which should also reduce the traffic news causes. We also upgraded to the next feed-io branch 5.x which removes the filtering which caused issues with some feeds.

I think there are more nice Ideas in this issue but since the discussion was mostly about who is responsible for what. I will close this.