project-chip / connectedhomeip

Matter (formerly Project CHIP) creates more connections between more objects, simplifying development for manufacturers and increasing compatibility for consumers, guided by the Connectivity Standards Alliance.
https://buildwithmatter.com
Apache License 2.0
7.45k stars 1.99k forks source link

Figure out interaction timeout behaviors and recovery logic #16202

Closed mrjerryjohns closed 2 years ago

mrjerryjohns commented 2 years ago

In steady state, when a given protocol interaction times out (e.g subscription failure), it's not clear exactly who/what handles recovery from such an occurrence, and exactly what that that recovery should entail.

Some key questions surface:

  1. How does an application get notified of such a failure? Through individual protocol-specific API surfaces (e.g ReadClient::Callback::OnError for reads, or CommandSender::Callback::OnError for commands, etc)? Or through a singular, common pathway?
  2. Who (and this could be multiple) should get notified of such an error?
  3. Are there automatic recovery steps that should be taken by the SDK? Or is it the application's responsibility entirely?
  4. What would the recovery steps entail?

Most of the questions here are from the perspective of the controller/client.

In discussions with @bzbarsky-apple, here's what we settled on as one possible strategy:

  1. Applications shouldn't likely be saddled with needing to aggregate all the various possible error callback sites to handle this. Instead, a single API should likely be added (perhaps even using the SessionRecoveryDelegate as the logical home for this).
  2. That API should likely get routed to the CASESessionManager to appropriate manage subsequent recovery logic.
  3. We should commence operational node discovery upon receiving OnFirstMessageDeliveryFailed. The results of that should be used to update the target address for a given session.
  4. As for the actual recovery, we should implement CASE session resumption, and always do that as the recovery step upon completion of operational node discovery. Resumption is a fairly low-cost operational that doesn't involve heavy crypto, and avoids the penalty of CASE session establishment. The alternatives here are unfortunately quite un-ideal:
    • Attempt to go ahead and re-use the old session for further interactions, and if those fail again, THEN, re-establish CASE. This tries to assume that the reason for the interaction timeout earlier was networking related and that the state on the peer hasn't actually changed (i.e it hasn't rebooted). This approach will increase the latency for recovery however.
    • Always reestablish CASE. This is conservative, but it will result in wasteful attempts to reestablish CASE when they are not needed.
  5. Add an API called UpdateSession in SessionHolder that will atomically, swap out an old session with a new one. This will likely be called by CASESessionManager upon completion of session resumption and will point to the new CASE session instance. This will automatically update session holders to point to the new session transparently.

One of the key benefactors of 5) above will be the re-subscription logic in ReadClient that will continue to work correctly across sessions as they get re-established/resumed.

kghost commented 2 years ago

As for the actual recovery, we should implement CASE session resumption, and always do that as the recovery step upon completion of operational node discovery. Resumption is a fairly low-cost operational that doesn't involve heavy crypto, and avoids the penalty of CASE session establishment. The alternatives here are unfortunately quite un-ideal:

I don't thing session resumption can fix any network related errors.

In case of device rebooted, if we implemented session persistent, the session resumption is able to reset the message counter. So we should implement session persistent first. W/o session persistent, the session resumption is not helpful at all.

Add an API called UpdateSession in SessionHolder that will atomically, swap out an old session with a new one. This will likely be called by CASESessionManager upon completion of session resumption and will point to the new CASE session instance. This will automatically update session holders to point to the new session transparently.

When a session is resumed, a new session with a new session key id is created, this means the existing exchanges will magically jump to the new session. In contract the spec says an exchange SHALL be bound to exactly one underlying session that will transport all associated Exchange messages for the life of that exchange.

Another solution is using a OnSessionResumed event, handled by ReadClient, so it can be bound to the new resumed session. Because the session resumption only works when a node is rebooted, we should invalid all existing exchanges.

mrjerryjohns commented 2 years ago

I don't thing session resumption can fix any network related errors.

It doesn't, but we don't have any way to tell why we saw a timeout. It could be because of a reboot on the peer, a change in IP address, or because of a momentary network disturbance. There's no way to distinguish between these cases, so we have to take a conservative approach and re-do discovery + CASE session establishment/resumption.

So we should implement session persistent first. W/o session persistent, the session resumption is not helpful at all.

Sure. When I stated session resumption, I was implicitly denoting all dependent work items to make that feature possible. Sounds like there's more items needed here to make happen...

When a session is resumed, a new session with a new session key id is created, this means the existing exchanges will magically jump to the new session.

Is this because we have a SessionHolder in the ExchangeContext? In that case, shouldn't the exchange be having a SessionHandle instead?

Another solution is using a OnSessionResumed event,

We have one mechanism to update sessions in the SessionHolder construct. Why would we make yet another construct do this?

Because the session resumption only works when a node is rebooted

So, if a node has not rebooted and still has a valid prior session, what's the behavior of a CASE session resumption message is sent to it?

mrjerryjohns commented 2 years ago

Based on discussions summarized here, we'll pivot to doing the following for the IM:

Fix-up the IM re-subscription logic to permit both applications managing the session setup on each ensuing re-subscription attempt, as well as the IM possibly handling that on its own by talking directly to CASESessionManager. IM re-subscription logic should be tweaked such that the expectations of provided a policy callback should be updated to now be responsible for arming the timer as well as re-establishing the session. Default implementation can arm the timer + talk to CASESessionManager to establish new session. Applications can over-ride that and do this themselves, and call directly into the IM to re-subscribe with a given session handle.