Non-ParseError exceptions during update_feeds() prevent updates for the remaining feeds

lemon24 commented 3 years ago

From https://github.com/lemon24/reader/issues/204#issuecomment-780553373:

Currently, storage or plugin-raised exceptions prevent updates for the following feeds, but that's not necessarily by design.

We should look at this when we expose the plugin API (#80).

lemon24 commented 1 year ago

...like here: https://github.com/lemon24/reader/issues/310#issuecomment-1593699333 (the Twitter plugin).

lemon24 commented 1 year ago

The main goal here is to allow updates to continue after a feed has failed.

Things to take into consideration:

We don't really need to wrap hook (including Session hooks) and retriever/(sub)parser calls one by one, we may get away with doing it somewhere at "top level", e.g. here in Pipeline.update().
- But, we should be consistent / uniform.
OTOH, when should unexpected exceptions (bugs) not allow updates to other feeds to continue?
1. reader "infrastructure" code (Parser, updater) should raise immediately; this is a reader bug, and Should Never Happen™.
  - How about StorageErrors (e.g. when the updater stores the feed)?
2. reader built-in retrievers/parsers should also probably raise immediately, but it may be acceptable to continue.
3. User-/plugin-supplied retrievers/parsers should continue (we've seen at least one bug here, in the Twitter plugin).
4. Update hooks should continue (we've seen a bug here too, IIRC).
  - What about {before,after}_feeds_update_hooks, which go "around" the update loop? We don't need to wrap them, but we may want to, for consistency.
  - What if an after_feeds_update_hooks depends on an after_feed_update_hooks having run successfully? Arguably, it shouldn't (we might want to mention this somewhere in the docs).
  - What about a Session hook, when called by a built-in retriever? It should continue.
  - Update: These already raise ParseError (when called by the HTTP retriever).
Update: How granular should we be with allowing things to continue?
- What do we return after_{entry,feed}_update_hooks fails?
- In theory, the feed was updated at this point, so we have an UpdatedFeed we can return; we could return the exception "on the side" somehow (should have UpdateResult and UpdatedFeed been a single struct from the beginning?).
- OTOH, hooks "should not fail" – if the user said a plugin should run, the plugin failing should be an error.
- Of course, we can go as granular as it is useful – but how granular is that? (See the next question too.)
- What happens if after_entry_update_hooks fails? Do we try to run it for the rest of the entries?

lemon24 commented 1 year ago

Some thoughts on update hook exception handling (also see the "How granular ..." question above):

Hooks "should not fail". Treating them otherwise opens up a lot of other decisions that need to be made, that potentially makes the end user API too complicated.
Currently, a hook failure for a feed raises immediately – that is, it prevents all other hooks from running for that feed.
Hooks that want to be resilient can follow the "tag before, clear tag after" pattern mentioned here: https://github.com/lemon24/reader/issues/246#issuecomment-1596097300. This allows completing the work the plugin was supposed to do on a subsequent run.
- This might prove to not be enough / make writing plugins too complicated. We can add more granularity later, if needed, but at the expense of API churn for plugins.
Nevertheless, failures happen; ideally, it should be hard for an updated feed to remain unprocessed because of a failure; moreso, a hook that works should not have to pay the price for one at the same stage that doesn't.
We can run all the hooks and raise an ExceptionGroup (Python 3.11) for the hooks that failed.
- But, if we do this for all stages, we or the plugins might need to coordinate between hooks belonging to the same plugin, e.g. before_feed_... and a corresponding after_feed_....
- For stages before the feed is updated (before_feed_...), does it make sense to continue with the feed update? If we fail early here, it simplifies the mental model quite a bit.
- For after_entry_..., raising an ExceptionGroup isn't all that useful (and we might end up with thousands of exceptions).

Somewhat related: We continue with at least the granularity of one feed mostly because update_feeds_iter() is a thin wrapper over for feed in get_feeds(): update_feed(feed) (but with exception handling; we should document this).

lemon24 commented 1 year ago

OK, here's how update hooks get called (pseudocode; might want to document it when we're sure the order stays):

for hook in before_feeds_update_hooks:
    hook(reader)

for feed in get_feeds():
    try:
        parsed_feed = parse(feed)
    except ParseError as e:
        parsed_feed = e

    intents = make_intents(feed, parsed_feed)

    for hook in before_feed_update_hooks:
        hook(reader, feed.url)

    store(intents)

    for entry in intents.entries:
        for hook in after_entry_update_hooks:
            hook(reader, entry.entry, entry.status)

    for hook in after_feed_update_hooks:
        hook(reader, feed.url)

    if not parsed_feed or isinstance(parsed_feed, Exception):
        yield feed.url, parsed_feed
    else:
        yield feed.url, make_updated_feed(intents)

for hook in after_feeds_update_hooks:
    hook(reader)

Following is a proposal that takes into account the notes from previous comments.

Goals:

Wrap hook failures so they don't prevent updates for other feeds during update_feeds_iter().
Wrap hook failures so it is clear where the error came from (which phase, optionally which hook).
After data has been written to storage, do our best to allow hooks to run (if one fails, the others should still run).

Hooks wrap unexpected exceptions as follows (new exceptions detailed below):

before_feeds_update_hooks -> raise SingleUpdateHookError
- fail early at the first hook failure
- not backwards compatible; however, exceptions are currently raised directly, which is arguably a bug
before_feed_update_hooks -> yield url, SingleUpdateHookError
- fail early at the first hook failure
- backwards compatible, we can already yield ReaderError
after_entry_update_hooks -> yield url, UpdateHookErrorGroup
- after also running after_feed_update_hooks
- if there are concerns about keeping to many exceptions around (max is entries * entry hooks), we can keep the first 5 or something, and log the rest; there is precedent for this
- backwards compatible, we can already yield ReaderError
after_feed_update_hooks -> yield url, UpdateHookErrorGroup
- including errors from after_entry_update_hooks
- shadows UpdatedFeed, acceptable (for now)
- backwards compatible, we can already yield ReaderError
after_feeds_update_hooks -> raise UpdateHookErrorGroup
- not backwards compatible, same as before_feeds_update_hooks

We add the following new exceptions:

ReaderError (old)
 +--- UpdateError
       +--- UpdateHookError
       |     +--- SingleUpdateHookError
       |     +--- UpdateHookErrorGroup [ExceptionGroup]
       +--- ParseError (old)

UpdateHookError will not be raised directly; it exists and has the shorter name because most people won't care about the individual errors (more than to log them), and will do except UpdateHookError; interested people can do except* SingleUpdateHookError to look at individual errors. SingleUpdateHookError cannot just be UpdateHookError because it will have attributes about the hook and stage that UpdateHookErrorGroup cannot not have.

For Python 3.10, which doesn't have ExceptionGroup, we use a shim like this one:

```python import traceback class _ExceptionGroup(Exception): """ExceptionGroup shim for Python 3.10. Caveat: The tracebacks always show in str(exc). Avoids dependency on https://pypi.org/project/exceptiongroup/ """ def __init__(self, msg, excs): super().__init__(msg, tuple(excs)) @property def message(self): return self.args[0] @property def exceptions(self): return self.args[1] def _format_lines(self): count = len(self.exceptions) s = 's' if count != 1 else '' yield f"{self.message} ({count} sub-exception{s})\n" for i, exc in enumerate(self.exceptions, 1): yield f"+{f' {i} '.center(36, '-')}\n" for line in traceback.format_exception(exc): for line in line.rstrip().splitlines(): yield f"| {line}\n" yield f"+{'-' * 36}\n" def __str__(self): return ''.join(self._format_lines()).rstrip() try: ExceptionGroup except NameError: ExceptionGroup = _ExceptionGroup # example try: 1/0 except Exception as e: one = e raise _ExceptionGroup('stuff happened', [ NameError('name'), _ExceptionGroup('more stuff happened', [ one, AttributeError('attr'), ]), ]) """ Traceback (most recent call last): File ".../exc.py", line 47, in raise _ExceptionGroup('stuff happened', [ __main__._ExceptionGroup: stuff happened (2 sub-exceptions) +---------------- 1 ----------------- | NameError: name +---------------- 2 ----------------- | _ExceptionGroup: more stuff happened (2 sub-exceptions) | +---------------- 1 ----------------- | | Traceback (most recent call last): | | File ".../exc.py", line 43, in | | 1/0 | | ZeroDivisionError: division by zero | +---------------- 2 ----------------- | | AttributeError: attr | +------------------------------------ +------------------------------------ """ ```

~~Open~~ Closed questions:

Is it OK for after_feed_update_hooks errors to shadow UpdatedFeed?
- Yes; we got the side effects that we expected, but we still signal errors.
Is it OK for after_feed_update_hooks errors to shadow a ParseError?
- No; arguably, after_*_update_hooks should not even run if there was an error getting the feed. ACTION ITEM
- If this isn't enough, we can add later an UpdateErrorGroup to include both the parse and hook errors.
- before_feed_update_hooks may run if we move them to before getting the feed, or if they become able to transform the feed update intent before it is stored (a feed update intent exists even if there was a ParseError, to set last_exception).
Should the feed or entry hooks set Feed.last_exception?
- Not really, per https://github.com/lemon24/reader/issues/68 last_exception is about the feed being broken, not a plugin (different audiences).

lemon24 commented 1 year ago

To do:

[x] update hooks
- [x] new exception types
- [x] wrap exceptions
- [x] better logging
- [x] update_feeds() UpdateHookError handling
- [x] change UpdateResult.value to be UpdateError
- [x] tests
- [x] refactor update hook passing and application
- [x] CLI updates
- [x] docs, versionchanged, changelog
[x] wrap sub-parser/retriever exceptions
[x] wrap session hook exceptions (nothing to do)

lemon24 commented 1 year ago

OK, moving on to unexpected retriever/parser unexpected errors: per https://github.com/lemon24/reader/issues/218#issuecomment-1595691410, they should not prevent updates for remaining feeds; wrapping seems the way to go.

(1) A possible maximal exception hierarchy:

ReaderError (old)
 +--- UpdateError (old)
       +--- ParseError [FeedError, ReaderWarning] (old)
       |     +--- UnexpectedParseError
       +--- RetrieveError [ParseError (temporary)]
             +--- UnexpectedRetrieverError

This has the benefit of distinguishing clearly between the errors caused by a retriever and by those caused by a (sub-)parser, while not making the hierarchy to deep.

Some disadvantages:

RetrieveError must remain a ParseError subclass until 4.0 to maintain backwards compatibility
exceptions raised in reader._parser don't have a specific common parent
there is no exception for non-retriever, non-sub-parser errors (e.g. raised in the meta-Parser ... but we don't have case like this now)

(2) Alternative that doesn't have them, at the expense of deeper hierarchy and slightly confusing naming (ParserError):

ReaderError (old)
 +--- UpdateError (old)
       +--- ParseError [FeedError, ReaderWarning] (old)
             +--- RetrieverError
             |     +--- UnexpectedRetrieverError
             +--- ParserError
                   +--- UnexpectedParserError

The minimal solution is to either (3) add no new classes, or (4) add a single UnexpectedParseError subclass.

An additional complication is that currently, Feed.last_exception contains details about the cause of the ParseError, if any, not the ParseError itself. If ParseError gets subclasses, it becomes imperative to also have details about the exception itself (type, message); arguably, we should have done this from the start (it would also allow capturing UpdateHookError details, if we ever decide to).

Update: I went with solution 3. https://github.com/lemon24/reader/commit/ccf8d26b74dd03fcc98774747cd6efb78cfb10bf

lemon24 commented 1 year ago

Session hook error already get wrapped in ParseError, and we have a test for it; good enough, for now.

lemon24 / reader

Non-ParseError exceptions during update_feeds() prevent updates for the remaining feeds #218