lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
438 stars 36 forks source link

Handle redirects and gone feeds gracefully #246

Open lemon24 opened 3 years ago

lemon24 commented 3 years ago

https://feedparser.readthedocs.io/en/latest/http-redirect.html

If you are polling a feed on a regular basis, it is very important to check the status code (d.status) every time you download. If the feed has been permanently redirected, you should update your database or configuration file with the new address (d.href). Repeatedly requesting the original address of a feed that has been permanently redirected is very rude, and may get you banned from the server.

Repeatedly requesting a feed that has been marked as “gone” is very rude, and may get you banned from the server.

lemon24 commented 1 year ago

Related comment:

https://github.com/lemon24/reader/blob/836ff81cf68343b415fb4956d8c69266120f3269/src/reader/_update.py#L460-L461

Misc thoughts:

zifot commented 1 year ago

Just a thought.

Consider API semantics that allows for a plugin to only mark feed url for a change. Then, after processing all of the plugins, you check if any plugin requested a change (and maybe make sure only one did it?), and make the change itself as part of the processing mechanism that runs outside of the plugins.

This is subtle change, but that way you (probably) can drop requirement that such plugin must run as a last one. Also, this seem to simplify issues you mention in the last point and allows for controlling if such request makes sense in the context of any other plugins or other external factors that may occur.

EDIT: typo

lemon24 commented 1 year ago

@zifot, that's actually a great idea, thank you!

I think it's doable right now with tags:

def after_feed_update(reader, feed, ...):
    # runs for each feed
    new_url = is_permanent_redirect(feed, ...)
    if new_url:
        reader.set_tag(feed, '.url-change-needed', new_url)

def after_feeds_update(reader):
    # runs after all the feeds
    for feed in reader.get_feeds(tags=['.url-change-needed']):
        new_url = reader.get_tag(feed, '.url-change-needed')
        # for later: how do we deal with InvalidFeedURLError?
        reader.change_feed_url(feed, new_url)
        reader.delete_tag(new_url, '.url-change-needed')

Note to self: This seems like a very useful pattern, mention it in the docs for plugin authors (when we have them). The way we're handling .reader.dedupe.once for entry_dedupe is vaguely similar (mark, then change).