[BUG] CASE timeout is to long on LIT ICD

ldoering commented 1 week ago

Reproduction steps

After sending/receiving a Sigma1 (but no follow up). a CASE Session is opened but no timeout is ever triggered. This leads to an error when trying to etablish a new CASE Sesssion; [SC] Received error (protocol code 4) during pairing process: src/protocols/secure_channel/CASESession.cpp:2144: CHIP Error 0x000000DB: The Resource is busy and cannot process the request It also prevents an ICD to leave active mode, draining the battery (https://github.com/project-chip/connectedhomeip/issues/35355).

I reproduced the issue with the LIT-ICD App on the Silabs EFR32 Platform, compiled with the Following Parameters: --icd --low-power enable_openthread_cli=false chip_enable_ota_requestor=true chip_persist_subscriptions=false I also enabled the openthread log, so I can see poll atempts to check the interval.

Flash the software, commission the device
Trigger a CASE Session from Chiptool, but exit Chiptool directly after sending the Sigma1
Watch the Logs of the Device. It will never timeout the case session
Wait some time (at least the expected case timeout)
Start Chiptool again, Trigger a CASE Session again
Observer the "Resource is busy" Error message in Chiptool and the device log.

The Issue can also be observed in the reverse way, if the DUT triggered the CASE Session, but I did not manage to do that with an unmodified example.

Bug prevalence

Everytime I ran the listed steps.

GitHub hash of the SDK that was being used

0e3434a0665ab908e27bef9a84d4429ab8f9d93a

Platform

efr32

Platform Version(s)

No response

Anything else?

No response

bzbarsky-apple commented 1 week ago

a CASE Session is opened but no timeout is ever triggered.

This should happen after 30+ seconds, when the device stops waiting for a response to the Sigma2 that it sent.

@ldoering when you "Wait some time (at least the expected case timeout)", what was the actual wait time and where did the "expected case timeout" come from? Can you please attach, not paste, the logs from the device that was the CASE responder in this case?

ldoering commented 1 week ago

I expected the session to fail after 30s (observing the retries after several seconds) and waited multiple minutes without observing a timeout. To get the actual expected timeout I added logging to SetResponseTimeout in ExchangeContext.cpp. Result: [00:00:06.805][info ][IN] CASE Server received Sigma1 message . Starting handshake. EC 0x2000da00 [00:00:06.805][info ][EM] Set Timeout: 45896846 ms

This timeout ist much longer than expected, it it based on the CHIP_DEVICE_CONFIG_ICD_SLOW_POLL_INTERVAL which is set to 3600000ms (1 hour) in the silabs LIT-ICD sample app (examples/lit-icd-app/silabs/openthread.gni), with additional backoff.

What kind of timeouts are expected in the LIT case? Is the prolonged interval a bug or by design? (I assume a bug, since it also prevents the ICD from sleeping)

bzbarsky-apple commented 1 week ago

Hmm. src/messaging/ReliableMessageProtocolConfig.cpp does:

config.mIdleRetransTimeout += ICDConfigurationData::GetInstance().GetSlowPollingInterval();

in GetLocalMRPConfig. And then in CASESession's ComputeRoundTripTimeout we have:

        GetRetransmissionTimeout(localMRPConfig.mActiveRetransTimeout, localMRPConfig.mIdleRetransTimeout,
                                 // Peer will assume we are active, since it's
                                 // responding to our message.
                                 System::SystemClock().GetMonotonicTimestamp(), localMRPConfig.mActiveThresholdTime);

What values are you seeing passed to GetRetransmissionTimeout there? What does it return?

bzbarsky-apple commented 1 week ago

@ldoering And definitely a timeout of 45896846s for a Sigma2 seems like a bug.

ldoering commented 1 week ago

For the LIT-Mode:

[00:00:08.661][info  ][IN] CASE Server received Sigma1 message . Starting handshake. EC 0x2000b238
[00:00:08.661][info  ][EM] S ComputeRoundTripTimeout: processing 30000
[00:00:08.662][info  ][EM] GetRetransmissionTimeout: active 300 idle 500 tslast 2 thresh 4000
[00:00:08.662][info  ][EM] UA GetMessageReceiptTimeout: ourLastActivity 8662
[00:00:08.662][info  ][EM] GetRetransmissionTimeout: active 3000 idle 3602000 tslast 0 thresh 5000
[00:00:08.663][info  ][EM] Set Timeout: 45896846 ms

I also disabled the LIT Mode and received a timeout of 281907 ms for the SIT configuration.

[00:03:37.693][info  ][IN] CASE Server received Sigma1 message . Starting handshake. EC 0x2000b238
[00:03:37.693][info  ][EM] S ComputeRoundTripTimeout: processing 30000
[00:03:37.693][info  ][EM] GetRetransmissionTimeout: active 300 idle 500 tslast 2 thresh 4000
[00:03:37.694][info  ][EM] UA GetMessageReceiptTimeout: ourLastActivity 217694
[00:03:37.694][info  ][EM] GetRetransmissionTimeout: active 3000 idle 17000 tslast 0 thresh 5000
[00:03:37.694][info  ][EM] Set Timeout: 281907 ms

I also attached the full device log for both modes:

ICD SIT Timeout.txt ICD LIT Timeout.txt

bzbarsky-apple commented 1 week ago

@ldoering Is tslast 0 logging the lastActivityTime value inside the GetRetransmissionTimeout call? If so, both 2 and 0 seem like really odd values for that; who is calling it with those values?

I assume UA GetMessageReceiptTimeout means UnathenticatedSession::GetMessageReceiptTimeout, but that should be passing ourLastActivity for lastActivityTime, which is being logged as very much not being 0 in both of the logs above.

bzbarsky-apple commented 1 week ago

Oh, this is interesting. So ourLastActivity is a TimeStamp, but lastActivityTime on GetRetransmissionTimeout is a Timeout.

using Timeout = Milliseconds32;
using Timestamp = Milliseconds64;

Does fixing the type of the lastActivityTime argument of GetRetransmissionTimeout to be Timestamp instead of Timeout change the behavior at all for you?

bzbarsky-apple commented 1 week ago

Ah, this even came up in the review in https://github.com/project-chip/connectedhomeip/pull/33093 but it was missed that this is in fact a serious problem because the actual timestamp might not fit in a 32-bit millisecond counter.... That said, I can't quite reconcile this possible issue with the values for ourLastActivity and tslast logged above, but I don't know what casting and logging format was used there....

ldoering commented 1 week ago

Sorry to confuse you there. I converted the timestamps to 32bit timeout for printing since the lib did not support direct 64bit printing. I assumed that this would be fine, since all tests run directly after reboot, so 32bit should be sufficient to count the runtime in ms. Tslast is timeSinceLastActivity, I thought the absolute time passed in the lastActivityTime would not be too helpful. Since this is run directly after receiving the packet it looked good to me.

To me it looks like a problem to use the LIT idle time to calculate timeouts.

bzbarsky-apple commented 1 week ago

To me it looks like a problem to use the LIT idle time to calculate timeouts.

Sure, but why is it being used?

I wonder whether one issue is that the threshold (5000) is so close to the active interval (4000). What that's saying is:

In active mode we check for activity ever 4 seconds.
After 5 seconds of inactivity we go into idle mode.
In Idle mode we check every 3602000 seconds or something.

So when computing how MRP will work here, assuming we start in active mode.

First packet will get sent.
Second packet will get sent something like 4 seconds later.
If the device has not gotten the packet by now it will transition to idle mode.
The idle interval will be used for computing the backoff on the remaining retransmits.

What the device actually does is that it never transitions to idle mode at all. But that's not what it claims to do, either to the other side or to itself....

@turon @Damian-Nordic @mkardous-silabs we need to figure out what should actually be happening here, because this is clearly broken.

Damian-Nordic commented 6 days ago

The question is if we should treat the active threshold shorter than all the MRP backoffs (using the active interval) as misconfiguration and we should warn a user if such is configured, or we should just use it to cancel a transmission earlier (before retransmitting a message 4 times).

In the latter case, we could make GetRetransmissionTimeout() take a flag indicating if the device operates as LIT and it should be modified as such:

for (uint8_t i = 0; i < CHIP_CONFIG_RMP_DEFAULT_MAX_RETRANS + 1; i++)
{
    System::Clock::Timeout baseInterval;
    if (timeSinceLastActivity + timeout < activityThreshold)
    {
        baseInterval = activeInterval;
    }
    else if (lit)
    {
        break;
    }
    else
    {
         baseInterval = idleInterval;
    }
    timeout += Messaging::ReliableMessageMgr::GetBackoff(baseInterval, i, /* computeMaxPossible */ true);
}

However, then ReliableMessageMgr::CalculateNextRetransTime() would also detect whenever an idle interval is about to be used for LIT and remove the message from the retrans table in such a case.

Btw, System::Clock::Timestamp could be replaced with System::Clock::Timeout as our timers don't handle 64-bit timeouts anyway.

bzbarsky-apple commented 5 days ago

As @Damian-Nordic pointed out on Slack, GetRetransmissionTimeout also needs to match the logic in ReliableMessageMgr::CalculateNextRetransTime which always treats things as "active" for non-initial messages on the exchange, which is very much not what GetRetransmissionTimeout() does.

project-chip / connectedhomeip