pubnub / dart

PubNub Dart SDK
Other
28 stars 15 forks source link

Can't recover from unhandled exceptions in the subscribe loop #63

Closed willbryant closed 2 years ago

willbryant commented 3 years ago

We're using the 4.0 beta.

We've been fielding complaints from our users that they are missing incoming messages. We have error reports for various exceptions, most of which look like they would probably come right if we reconnect.

What we found confusing was that we weren't able to fix the problem by taking actions in our code to do that. Specifically, we have tried both calling reconnect(), and cancelling the subscription and all stream listeners and then subscribing again.

When this happens, I see the State has been updated. log and Resuming subscription., but never see another Starting new loop iteration..

I've been debugging the library code and I believe that the problem is that if no resolution is found by the strategy, then SubscribeLoop's Stream<Envelope> _loop() async* will rethrow:

        if (resolutions == null) {
          _logger.silly('No resolutions found.');
          rethrow;
        }

And the same if the resolution is Fail, though that isn't really central to my concern since we don't want that resolution anyway and so have a retry policy (which I see in the code retries infinitely for a subscribe fiber - fine by me):

          if (resolution is FailResolution) {
            rethrow;

Anyway, throwing/rethrowing out of _loop causes an addError on the stream, which is how we know to try our reconnect and resubscribe.

But _loop() will never re-enter its while (true) loop. If I understand the language spec correctly, the method is called when the stream is listen()ed on, which only happens in the SubscribeLoop constructor:

    var loopStream = _loop();

    loopStream.listen((envelope) {
      _messagesController.add(envelope);
    }, onError: (exception) {
      _messagesController.addError(exception);
    });

And even if we wanted to, we couldn't reenter without recreating the stream, because the stream is closed when the method exits.

But it looks to me like the manager class only ever creates a SubscribeLoop in its constructor:

  Manager(Core core, this.keyset) {
    _loop = SubscribeLoop(core, SubscribeLoopState(keyset));
  }

Therefore, after any exception thrown/rethrown by _loop(), there will never again be any events emitted on any subscription stream?

So if I have the above stuff right, I think any time there is an unhandled exception, subscriptions become permanently unusable for the whole PubNub instance, but subscribe() silently stops working rather than raising an error?

are commented 3 years ago

Hi! Thanks for such an extensive insight! I believe you are correct, this seems to be an oversight in the design of the subscribe loop.

Do you have any preference on how you would want to restart the subscription? Given that after the subscribe loop stumbles upon an unrecoverable exception, I think it would be wise to recreate it - question is if it should happen manually (by calling a method) or automatically (which may interfere with the app logic)?

Another option would be to "close" existing subscriptions and force the user to recreate all of them, but I suppose that could be cumbersome.

Let me know if you have any preference on how you would like to handle situations like this in code, in the meantime I will explore existing ideas and implement them.

Thanks again for reporting this issue! I appreciate it a lot!

willbryant commented 3 years ago

I can see pros and cons, and I have some questions.

Resumption behavior

If you automatically re-create the SubscribeLoop, what happens to resumption?

If the library silently restarts subscriptions, that makes it easier to develop most apps, but then it becomes important to resume from the point that it left off, or else messages will get silently skipped which would be much worse than not automatically handling error conditions.

So I guess my expectation here is that if you recreate the SubscribeLoop, you will reuse the state or at least the timeToken from it. Is that feasible? If not, I think the error should be exposed to the app and the app dev forced to resubscribe so they can reinitialize from scratch.

Whereas if it is feasible, I think it's much nicer than having to re-subscribe.

Multiple subscriptions

If the application needs to manually re-subscribe, we need to consider the case where the app has multiple subscriptions in separate parts of the code. For example, in our app we have one for person-to-person chat and one for programmatic server events. The first one may or may not be instantiated depending on whether the user has gone to that screen yet.

So if the Manager as a whole has become invalid because the SubscribeLoop exited, do both places in the app need to call subscribe() again, or just one?

The SubscribeLoop covers the whole Manager so I believe if createSubscription resulted in the replacement of the SubscribeLoop, that would fix both.

That seems good in some ways, but it feels kind of awkward - people are going to get hard-to-debug conditions if it turns out that one subscription's error handling actually fixes the other too, in apps where they don't both exist all the time (we do a lot of lazy creation in Flutter thanks to the Provider pattern). It's sort of understandable but not really following the principle of least surprise.

Close-on-error behavior

If you need to manually re-subscribe, I would expect that you'd see the streams get closed so it's clear that they are not going to deliver any more messages and need to be recreated. As well as making it more obvious, I think it's more idiomatic and helps people cleanly exit their methods.

But on the other hand, for Dart streams in general, listen gives you a choice of closeOnError behavior, which for us is a little awkward, because it defaults to closeOnError: false and so it would perhaps be a little surprising if we more or less ignored that and closed the subscription anyway when an error occurred.

Dunno what to do about this one.

Fast retry problem

I guess the above points are making me lean towards it being automatic.

One thought though, if you automatically recreate, that's sort of like saying that you never want to use the fail resolution, you always want retry. It's just a higher-level retry, the SubscribeLoop is being recreated rather than just looping around (but as a library user, you won't know or care about this).

I guess my argument would be that in that case it would be better to be explicit about that in the strategy - it already has other logic that changes the retry behavior if it's a subscribe thread. And it's probably better to retry after a delay than immediately spin around?

github-actions[bot] commented 2 years ago

@willbryant this issue is addressed in v4.0.0