Open ldoering opened 1 week ago
a CASE Session is opened but no timeout is ever triggered.
This should happen after 30+ seconds, when the device stops waiting for a response to the Sigma2 that it sent.
@ldoering when you "Wait some time (at least the expected case timeout)", what was the actual wait time and where did the "expected case timeout" come from? Can you please attach, not paste, the logs from the device that was the CASE responder in this case?
I expected the session to fail after 30s (observing the retries after several seconds) and waited multiple minutes without observing a timeout. To get the actual expected timeout I added logging to SetResponseTimeout in ExchangeContext.cpp. Result: [00:00:06.805][info ][IN] CASE Server received Sigma1 message . Starting handshake. EC 0x2000da00 [00:00:06.805][info ][EM] Set Timeout: 45896846 ms
This timeout ist much longer than expected, it it based on the CHIP_DEVICE_CONFIG_ICD_SLOW_POLL_INTERVAL which is set to 3600000ms (1 hour) in the silabs LIT-ICD sample app (examples/lit-icd-app/silabs/openthread.gni), with additional backoff.
What kind of timeouts are expected in the LIT case? Is the prolonged interval a bug or by design? (I assume a bug, since it also prevents the ICD from sleeping)
Hmm. src/messaging/ReliableMessageProtocolConfig.cpp
does:
config.mIdleRetransTimeout += ICDConfigurationData::GetInstance().GetSlowPollingInterval();
in GetLocalMRPConfig
. And then in CASESession's ComputeRoundTripTimeout
we have:
GetRetransmissionTimeout(localMRPConfig.mActiveRetransTimeout, localMRPConfig.mIdleRetransTimeout,
// Peer will assume we are active, since it's
// responding to our message.
System::SystemClock().GetMonotonicTimestamp(), localMRPConfig.mActiveThresholdTime);
What values are you seeing passed to GetRetransmissionTimeout
there? What does it return?
@ldoering And definitely a timeout of 45896846s for a Sigma2 seems like a bug.
For the LIT-Mode:
[00:00:08.661][info ][IN] CASE Server received Sigma1 message . Starting handshake. EC 0x2000b238
[00:00:08.661][info ][EM] S ComputeRoundTripTimeout: processing 30000
[00:00:08.662][info ][EM] GetRetransmissionTimeout: active 300 idle 500 tslast 2 thresh 4000
[00:00:08.662][info ][EM] UA GetMessageReceiptTimeout: ourLastActivity 8662
[00:00:08.662][info ][EM] GetRetransmissionTimeout: active 3000 idle 3602000 tslast 0 thresh 5000
[00:00:08.663][info ][EM] Set Timeout: 45896846 ms
I also disabled the LIT Mode and received a timeout of 281907 ms for the SIT configuration.
[00:03:37.693][info ][IN] CASE Server received Sigma1 message . Starting handshake. EC 0x2000b238
[00:03:37.693][info ][EM] S ComputeRoundTripTimeout: processing 30000
[00:03:37.693][info ][EM] GetRetransmissionTimeout: active 300 idle 500 tslast 2 thresh 4000
[00:03:37.694][info ][EM] UA GetMessageReceiptTimeout: ourLastActivity 217694
[00:03:37.694][info ][EM] GetRetransmissionTimeout: active 3000 idle 17000 tslast 0 thresh 5000
[00:03:37.694][info ][EM] Set Timeout: 281907 ms
I also attached the full device log for both modes:
@ldoering Is tslast 0
logging the lastActivityTime
value inside the GetRetransmissionTimeout
call? If so, both 2 and 0 seem like really odd values for that; who is calling it with those values?
I assume UA GetMessageReceiptTimeout
means UnathenticatedSession::GetMessageReceiptTimeout
, but that should be passing ourLastActivity
for lastActivityTime
, which is being logged as very much not being 0 in both of the logs above.
Oh, this is interesting. So ourLastActivity
is a TimeStamp, but lastActivityTime
on GetRetransmissionTimeout
is a Timeout
.
using Timeout = Milliseconds32;
using Timestamp = Milliseconds64;
Does fixing the type of the lastActivityTime
argument of GetRetransmissionTimeout
to be Timestamp
instead of Timeout
change the behavior at all for you?
Ah, this even came up in the review in https://github.com/project-chip/connectedhomeip/pull/33093 but it was missed that this is in fact a serious problem because the actual timestamp might not fit in a 32-bit millisecond counter.... That said, I can't quite reconcile this possible issue with the values for ourLastActivity
and tslast
logged above, but I don't know what casting and logging format was used there....
Sorry to confuse you there. I converted the timestamps to 32bit timeout for printing since the lib did not support direct 64bit printing. I assumed that this would be fine, since all tests run directly after reboot, so 32bit should be sufficient to count the runtime in ms. Tslast is timeSinceLastActivity, I thought the absolute time passed in the lastActivityTime would not be too helpful. Since this is run directly after receiving the packet it looked good to me.
To me it looks like a problem to use the LIT idle time to calculate timeouts.
To me it looks like a problem to use the LIT idle time to calculate timeouts.
Sure, but why is it being used?
I wonder whether one issue is that the threshold (5000) is so close to the active interval (4000). What that's saying is:
3602000
seconds or something.So when computing how MRP will work here, assuming we start in active mode.
What the device actually does is that it never transitions to idle mode at all. But that's not what it claims to do, either to the other side or to itself....
@turon @Damian-Nordic @mkardous-silabs we need to figure out what should actually be happening here, because this is clearly broken.
The question is if we should treat the active threshold shorter than all the MRP backoffs (using the active interval) as misconfiguration and we should warn a user if such is configured, or we should just use it to cancel a transmission earlier (before retransmitting a message 4 times).
In the latter case, we could make GetRetransmissionTimeout()
take a flag indicating if the device operates as LIT and it should be modified as such:
for (uint8_t i = 0; i < CHIP_CONFIG_RMP_DEFAULT_MAX_RETRANS + 1; i++)
{
System::Clock::Timeout baseInterval;
if (timeSinceLastActivity + timeout < activityThreshold)
{
baseInterval = activeInterval;
}
else if (lit)
{
break;
}
else
{
baseInterval = idleInterval;
}
timeout += Messaging::ReliableMessageMgr::GetBackoff(baseInterval, i, /* computeMaxPossible */ true);
}
However, then ReliableMessageMgr::CalculateNextRetransTime()
would also detect whenever an idle interval is about to be used for LIT and remove the message from the retrans table in such a case.
Btw, System::Clock::Timestamp
could be replaced with System::Clock::Timeout
as our timers don't handle 64-bit timeouts anyway.
As @Damian-Nordic pointed out on Slack, GetRetransmissionTimeout
also needs to match the logic in ReliableMessageMgr::CalculateNextRetransTime
which always treats things as "active" for non-initial messages on the exchange, which is very much not what GetRetransmissionTimeout()
does.
Reproduction steps
After sending/receiving a Sigma1 (but no follow up). a CASE Session is opened but no timeout is ever triggered. This leads to an error when trying to etablish a new CASE Sesssion; [SC] Received error (protocol code 4) during pairing process: src/protocols/secure_channel/CASESession.cpp:2144: CHIP Error 0x000000DB: The Resource is busy and cannot process the request It also prevents an ICD to leave active mode, draining the battery (https://github.com/project-chip/connectedhomeip/issues/35355).
I reproduced the issue with the LIT-ICD App on the Silabs EFR32 Platform, compiled with the Following Parameters: --icd --low-power enable_openthread_cli=false chip_enable_ota_requestor=true chip_persist_subscriptions=false I also enabled the openthread log, so I can see poll atempts to check the interval.
The Issue can also be observed in the reverse way, if the DUT triggered the CASE Session, but I did not manage to do that with an unmodified example.
Bug prevalence
Everytime I ran the listed steps.
GitHub hash of the SDK that was being used
0e3434a0665ab908e27bef9a84d4429ab8f9d93a
Platform
efr32
Platform Version(s)
No response
Anything else?
No response