samuelclay / NewsBlur

NewsBlur is a personal news reader that brings people together to talk about the world. A new sound of an old instrument.
http://www.newsblur.com
MIT License
6.85k stars 995 forks source link

Backend: RFC 5005 feed history support (RSS backfill) #1109

Closed jameysharp closed 2 years ago

jameysharp commented 6 years ago

One long-standing flaw for using RSS/Atom to read long-form works like webcomics and fanfiction has been that feeds usually contain only a limited number of the most recent entries, so you have to catch up by reading the site directly and then switch tools to get notified about new posts. In my experience it's much nicer to be able to use a single tool to keep track of how much of the story I've already read, no matter how far back in the history I leave off. A decade ago I built Comic Rocket as a proprietary tool to do that, but I'm hoping to advocate for a more standards-based approach.

It looks like NewsBlur saves old feed entries even after they disappear from the origin feed, but this doesn't reliably solve the problem, especially since creators often insert/edit/delete old pages and there's no way to detect that the cached feed entries are no longer valid.

RFC 5005, "Feed Paging and Archiving", addresses this problem, and I'd like to encourage people to adopt it. It was standardized in 2007, but seems to have languished in obscurity. I'm not aware of any publishers using it today aside from people I've personally encouraged, although it'd be fascinating to find out whether any feeds that newsblur.com has seen either use the http://purl.org/syndication/history/1.0 XML namespace, or contain <link rel="prev-archive">. There's something of a catch-22 here since publishers don't have much incentive to implement the spec if feed readers don't understand it, and vice versa.

That said, I'm working on various tools to generate full-history feeds by crawling arbitrary sites, as a transitional measure. So I'm hoping to find a project like NewsBlur that's willing to be an early adopter for the reader side of the spec.

The spec is nice in that conforming feeds are still usable by feed readers that don't understand the RFC 5005 metadata, but readers that do can save the complete history of all entries and efficiently discover changes to archived entries.

If you want to implement RFC 5005, I think the easiest first step is to check each feed for the <fh:complete/> tag specified in section 2, "Complete Feeds". If present, then you can delete all entries which you previously saved from that feed if they're no longer present in the current version of the feed. This is a very simple solution for feeds that don't have much history.

For feeds that would be excessively large if the publisher put the full history in one feed document, there's section 4, "Archived Feeds". To implement this, you'd check for <link rel="prev-archive"> in each feed, and concatenate the linked feed's entries, following further prev-archive links until there aren't any more. You can cache an archive feed from a given URL forever:

The requirement that archive documents be stable allows clients to safely assume that if they have retrieved one in the past, it will not meaningfully change in the future. As a result, if an archive document's contents are changed, some clients may not become aware of the changes.

I expect some conforming publishers will change an archive feed's URL if they need to update archived entries (although this is arguably discouraged in the spec). So you'd want to ensure that you can detect that a previously-seen archive feed is no longer in the chain of prev-archive links, and delete any entries that don't appear in the rest of the feed. I imagine the easiest way to do that is to reconstruct the feed history from scratch on every update, but I can imagine other alternatives.

That, plus the duplicate-detection and UI recommendations in section 4.2, I think should be everything you'd need to know about this part of the standard.

(I skipped section 3, "Paged Feeds", because I don't think it's relevant to a NewsBlur-style feed reader, but maybe there's a good use case for it that I haven't thought of.)

I might be able to put together a pull request for this, if it doesn't sound like wasted effort and if someone can advise me on how this might fit into the current code base. What do you think?

samuelclay commented 6 years ago

Hey, I really love this idea and I read the thread of your work 2 days ago on Jekyll (smart thinking to start that PR first) but I'm not sure NewsBlur is going to be a good fit for this for one reason. That is the cost of archiving. NewsBlur today only supports at most the most recent 500 stories for a feed. It actively deletes stories over that threshold, not including shared and saved stories.

It's enormously expensive to host all that content, so I unfortunately have to trim it regularly. In fact, if I were to do the math, I think cleaning the archive consumes between 10-20% of my feed fetcher's time.

Now if people run their own NewsBlur instance, there's an easy one line change to boost that number up to virtually unlimited, but the main hosted instance will have to hold on to that limit for performance and cost reasons. I wish I were as big as even a small chunk of Google and had the resources to archive the web, but NewsBlur pulls tens of millions of websites, and all those stories have to go somewhere.

Plus, I immediately found a few "rss-bombs" that constantly published huge globs of randomized data. Filling out an archive is not a high enough priority that is being requested by users, so I don't have the financial incentive I need to support it. I wish it weren't so.

dosiecki commented 6 years ago

A bit of a wild idea, but could there be a way to crowdfund the long-term storage of feed content? That is to say, any user could pay a few dollars to extend the storage for a given feed for an extra few thousand stories for a year or two. Popular feeds could attract multiple supporters and gain virtually unlimited storage. Users with a particular interest in an obscure feed could personally "archive" it by way of supporting it alone. I'd do this for certain feeds for sure!

jameysharp commented 6 years ago

Thank you for this thoughtful response! I hadn't considered storage costs on your end; I will now keep that in mind as I talk with other folks and work on my own implementations.

I would like to still encourage implementing at least RFC5005 section 2, which would allow you to delete items as soon as they disappear from a conforming feed. That can only save you storage, right?

I do wonder if you could support section 4 by lazily loading feed pages from the origin server as the user browses through the history, treating your storage as purely a cache for such feeds. That might even allow you to be more aggressive about discarding items from your cache? I feel a little overwhelmed just imagining implementing that, but I thought I'd throw the idea out there anyway.

In case you revisit this in the future, I'll mention that in addition to jekyll/jekyll-feed#236 which you already saw, I've also built https://fh.minilop.net/ and https://github.com/jameysharp/wp-fullhistory as two other implementations of sections 2 and 4 of RFC5005.

mockdeep commented 6 years ago

This would be amazing! @samuelclay not sure how much it would cost for something like that per user, but I'd be willing to pay for a higher cost tier if this was available. Maybe some of the cost could be limited by not making it the default behavior, but allowing users to "download feed backlog" somewhere for individual feeds.

samuelclay commented 2 years ago

Pinging back on this thread, by way of https://github.com/jekyll/jekyll-feed/pull/236 (as I'm about to launch this feature). So good news, this is now high priority and is well on its way to the public.

jameysharp commented 2 years ago

Cool! Anything I can do to help?

You might like to take a look at this prototype feed-reader I built in 2020 (https://github.com/jameysharp/crawl-rss) which demonstrates an algorithm and database schema that made sense to me, and includes a bunch of unit tests covering different kinds of edits publishers might make to their feed history. I licensed that AGPLv3 but if anyone wants to reuse any of the tests, just go for it.

In the jekyll-feed issue you mentioned "joining WordPress in automatically enabling feed paging;" do you know something I don't? To the best of my knowledge, the only discussion that's ever happened about this on the WordPress side are the discussion thread and issue that I opened; the latter never got any response at all.

That said, my wp-fullhistory plugin still seems to work; it's running at https://news.comic-rocket.com for example. In addition, my crawl-rss prototype feed reader demonstrates using a WordPress-specific stateless proxy I wrote (https://github.com/jameysharp/wp-5005-proxy) that synthesizes a full-history feed for any existing WordPress install, without needing cooperation from the publisher's side.

While RFC5005 support remains sparse among publishers right now, I think delegating to specialized HTTP proxies like wp-5005-proxy could be a good way to adapt many sites which don't already support RFC5005. It makes for a pretty simple "plugin API", in my opinion. Perhaps you'd like to do something similar as well?

samuelclay commented 2 years ago

In the jekyll-feed issue you mentioned "joining WordPress in automatically enabling feed paging;" do you know something I don't?

I'm still doing development work so I don't have real numbers yet (although this query will inform which numbers I will surface), but in my testing a subset of my own feeds, I went from NewsBlur's limit of 26,225 stories to 49,263. And that's with a page limit of 100, which I will probably boost to 500.

This is by adding ?page=N and ?paged=N to the feed url and seeing if the stories come out differently from 1..3, and if they do to keep going until no new stories are seen. And that's worked for a ton of feeds, so I assumed it was all WP, but it's possible it's not. It might just be the result of a few high volume feeds, in which case, the numbers I need to pull are about how many feeds go beyond page 4 (in other words, supports paging), and how many stories are found per archive feed.

samuelclay commented 2 years ago

Funny enough, after looking through RFC5005 I realize I'm not following it at all, so I'll try and implement that behavior as well today.

jameysharp commented 2 years ago

Ah, that explains it!

As far as I know, you're right that all WordPress installations support pagination query parameters. But I gather your use case requires noticing when someone changes, deletes, or backdates their posts, so you can report on the modification history. Then the question is, once you've fetched all the history, how do you detect further changes in the history?

If you rescan all the archives every time you poll the feed, publishers will scream. People already don't like having their servers hammered by feed readers that don't implement HTTP caching correctly, and this would be much worse.

The RFC provides an efficient way to discover changes. In the common case where none of the history has been edited, a compliant feed reader will only fetch the current feed document, just like a reader that doesn't fetch history at all. With default WordPress pagination, if you tell it to sort by last modified date instead of publication date, then you can get the common case down to two requests, but not one.

Although it may not be relevant for your use case, the RFC also enables feed readers to treat their local copy of the feed purely as a cache. They can discard entries at any time because the URL of the archive feed that the entry came from is stable. Default WordPress pagination doesn't provide that stability, either.

My wp-5005-proxy uses a couple tricks to statelessly add the necessary information. Because it's stateless, every time a request comes in for the current feed document, the proxy may do O(log n) HEAD requests to identify how many archived pages there are, and an extra GET request to check whether the history has changed. You can usually eliminate the HEAD requests if you keep track of how many pages there were the last time you checked.

My wp-fullhistory plugin works similarly, except as a WordPress plugin it can just do quick database queries for the necessary information. Doing the work on the publisher's side is clearly better.

My patched jekyll-feed plugin, on the other hand, implements the RFC using git-style hash chains over the archive pages. It demonstrates that publishers can implement the RFC by doing work only when the feed changes, rather than every time a reader requests the feed.

I hope this has helped clarify why I think this standard is important if you want feed history.

samuelclay commented 2 years ago

Then the question is, once you've fetched all the history, how do you detect further changes in the history?

I periodically force a re-fetch of the entire history. So changes eventually make their way in but not immediately.

Although it may not be relevant for your use case, the RFC also enables feed readers to treat their local copy of the feed purely as a cache.

I noticed this but you're right, I'm not deleting stories that are removed from a publisher's archive. Publishers will sometimes email me to ask, and I am always happy to remove it that way. But if a user saved or shared the story, that saved or shared story continues to exist.

I hope this has helped clarify why I think this standard is important if you want feed history.

Agreed that this would be an ideal world, but reality is that I think the page and paged parameters won out and it's up to feed readers to do the dirty work of distinguishing changes and staying up to date, as has been the case before PubSubHubbub.

jameysharp commented 2 years ago

Have you found any feeds which do support page or paged query parameters, but which are not generated by WordPress?

As far as I can tell, the fact that this works for WordPress is a happy accident. They use the same query implementation for feeds that they use for HTML archives. Since the latter needed pagination, the former gets it too.

I'd be quite surprised if any static site generators support these query parameters, since every user would have to add unusual options to their web server configuration to preserve query parameters when serving static files.

I'm also skeptical that any other CMS implemented this style of query parameter deliberately, since there haven't been any feed readers that would use it.

But if you find any counter-examples I would be very interested in looking into them!

You might also look for atom:link tags with rel=next or rel=prev. I believe Mastodon implemented the latter. It's a natural thing to think of for people familiar with HTML's use of those link relations, so I wouldn't be surprised if that's more common.

It's also standardized in section 3 of RFC5005, except that spells the latter as "previous", which was dropped in HTML 5. Also that document has big warnings about this approach:

"Paged feeds are lossy; that is, it is not possible to guarantee that clients will be able to reconstruct the contents of the logical feed at a particular time. Entries may be added or changed as the pages of the feed are accessed, without the client becoming aware of them."

"Therefore, clients SHOULD NOT present paged feeds as coherent or complete, or make assumptions to that effect."

samuelclay commented 2 years ago

@jameysharp Ok, I've just about implemented it but there's an issue coming from the test server you put up. I start with the first URL, and each successive URL is the prev-archive link. Notice the protocols and ports. Redirects aren't working as I would expect.

>>> import feedparser
>>> pprint(feedparser.parse("https://fh.minilop.net/7/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d").feed.links)
[{'href': 'http://fh.minilop.net/e/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'alternate',
  'type': 'text/html'},
 {'href': 'http://fh.minilop.net/f/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'current',
  'type': 'text/html'},
 {'href': 'http://fh.minilop.net/8/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'next-archive',
  'type': 'text/html'},
 {'href': 'http://fh.minilop.net/6/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'prev-archive',
  'type': 'text/html'},
 {'href': 'http://fh.minilop.net/7/America%2BNew_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'self',
  'type': 'application/rss+xml'}]
>>> pprint(feedparser.parse("http://fh.minilop.net/6/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d").feed.links)
[{'href': 'http://fh.minilop.net:443/e/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'alternate',
  'type': 'text/html'},
 {'href': 'http://fh.minilop.net:443/f/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'current',
  'type': 'text/html'},
 {'href': 'http://fh.minilop.net:443/7/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'next-archive',
  'type': 'text/html'},
 {'href': 'http://fh.minilop.net:443/5/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'prev-archive',
  'type': 'text/html'},
 {'href': 'http://fh.minilop.net:443/6/America%2BNew_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
  'rel': 'self',
  'type': 'application/rss+xml'}]
>>> pprint(feedparser.parse("http://fh.minilop.net:443/5/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d"))
{'bozo': 1,
 'bozo_exception': SAXParseException('mismatched tag'),
 'encoding': 'us-ascii',
 'entries': [],
 'feed': {'summary': '<center><h1>400 Bad Request</h1></center>\n'
                     '<center>The plain HTTP request was sent to HTTPS '
                     'port</center>\n'
                     '<hr /><center>nginx</center>'},
 'headers': {'connection': 'close',
             'content-length': '264',
             'content-type': 'text/html',
             'date': 'Fri, 29 Apr 2022 15:28:14 GMT',
             'server': 'nginx',
             'strict-transport-security': 'max-age=15724800; '
                                          'includeSubdomains'},
 'href': 'http://fh.minilop.net:443/5/America+New_York/27438e518e/https:/fh.minilop.net/%25Y/%25m/%25d',
 'namespaces': {},
 'status': 400,
 'version': ''}
jameysharp commented 2 years ago

Ugh, my personal server is misconfigured, I guess. Try these:

Also, here are a couple of examples of feeds which aren't paginated but which include the fh:complete tag from RFC5005 section 2, in case you want to do anything with that:

samuelclay commented 2 years ago

Looking good!

Screen Shot 2022-04-29 at 3 22 50 PM