Closed ArekBalysNordic closed 1 year ago
@ArekBalysNordic I am not following the failure scenario. Is it possible to attach a log that shows the problem?
The intent of the current timers is that we should:
So there are three ways a resolve can go:
Resolver::HandleAction
, call NextAction
, elapsed < mRequest.GetMinLookupTime()
and we will do the "keep searching" thing as linked above. Then when the 200ms timer fires we will call HandleAction
from the timer callback and report those results to our consumer.Resolver::HandleAction
, call NextAction
, and report those results to our consumer immediately.We should not be waiting until the 45s timer if we have results sitting around waiting to be reported.
@bzbarsky-apple, I've debugged it more today to collect as much info/logs as possible and I can see that behavior occurs only when the Lookup status time is near the kMinLookupTimeMsDefault
value. For example, when the kMinLookupTimeMsDefault
was set to 200, so as it is currently, and we got Checking node lookup status after 199 ms
we waited 45 seconds for the next query. I tried to reproduce it according to my suspicion (it occurs when the time is smaller than kMinLookupTimeMsDefault
) and I changed the kMinLookupTimeMsDefault
value to 400 ms. Then everything worked as you described - The first query occurred after 202 ms, the device did "keep searching", and then the next one occurred immediately after the declared 400 ms and the message to the OTA provider was sent.
There are some logs:
1) Case, when the kMinLookupTimeMsDefault
is set to 200 ms and the checking occurs after 198 ms:
//// Device rebooted, and the new firmware has been applied ///
I: 461 [DL]CHIP task running
I: 1308 [DL]SRP Client was started, detected server: fd82:300d:1f31:56e0:9d48:ffb0:424e:5865
I: 1975 [DIS]Setting operational delegate post init
I: 1981 [DIS]Updating services using commissioning mode 0
I: 1986 [DIS]Advertise operational node 5FABA539C3FD84DE-516BB32D2607E4F2
I: 1993 [DL]advertising srp service: 5FABA539C3FD84DE-516BB32D2607E4F2._matter._tcp
I: 2002 [DIS]Advertise commission parameter vendorID=65521 productID=32768 discriminator=3840/15 cm=0
I: 2011 [DL]advertising srp service: 60E5A449DE838D77._matterc._udp
D: 2018 [SWU]Establishing session to provider node ID 0x0C82FE6CFCFB73FC on fabric index 1
D: 2026 [CTL]OperationalSessionSetup[1:0C82FE6CFCFB73FC]: State change 1 --> 2
I: 2033 [DIS]Resolving 5FABA539C3FD84DE:0C82FE6CFCFB73FC ...
I: 2195 [DIS]Node ID resolved for 5FABA539C3FD84DE:0C82FE6CFCFB73FC
D: 2201 [DIS] Hostname: 0242C0A808030000
D: 2205 [DIS] IP Address #1: fd11:db8:1::3
D: 2209 [DIS] Port: 5540
D: 2211 [DIS] Mrp Interval idle: 300 ms
D: 2215 [DIS] Mrp Interval active: 2000 ms
D: 2218 [DIS] TCP Supported: 1
D: 2221 [DIS]Lookup clearing interface for non LL address
I: 2226 [DIS]UDP:[fd11:db8:1::3]:5540: new best score: 4
I: 2231 [DIS]Checking node lookup status after 198 ms
I: 2236 [DIS]Keeping DNSSD lookup active
I: 47033 [DIS]Checking node lookup status after 45000 ms
D: 47038 [DIS]OperationalSessionSetup[1:0C82FE6CFCFB73FC]: Updating device address to UDP:[fd11:db8:1::3]:5540 while in state 2
D: 47049 [CTL]OperationalSessionSetup[1:0C82FE6CFCFB73FC]: State change 2 --> 3
D: 47057 [IN]SecureSession[0x20004638]: Allocated Type:2 LSID:21705
I: 47063 [SC]Initiating session on local FabricIndex 1 from 0x516BB32D2607E4F2 -> 0x0C82FE6CFCFB73FC
I: 47206 [EM]<<< [E:52970i M:62295202] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
I: 47217 [IN](U) Sending msg 62295202 to IP address 'UDP:[fd11:db8:1::3]:5540'
I: 47226 [SC]Sent Sigma1 msg
/// from that point everything works///
2) Case, when the kMinLookupTimeMsDefault
is set to 400 ms and the checking occurs after 202 ms:
//// Device rebooted, and the new firmware has been applied ///
I: 479 [DL]CHIP task running
I: 504 [ZCL]Cluster OnOff: attribute OnOff set to 0
I: 509 [ZCL]Cluster LevelControl: attribute CurrentLevel set to 254
I: 1253 [DL]SRP Client was started, detected server: fd82:4ffc:1b42:e9b0:7a67:e8f2:44da:8558
I: 1435 [DIS]Setting operational delegate post init
I: 1442 [DIS]Updating services using commissioning mode 0
I: 1447 [DIS]Advertise operational node 23B9886545C01AE8-0000000000000001
I: 1454 [DL]advertising srp service: 23B9886545C01AE8-0000000000000001._matter._tcp
I: 1462 [DIS]Advertise commission parameter vendorID=65521 productID=32773 discriminator=3840/15 cm=0
I: 1472 [DL]advertising srp service: FBD3F8BAC26F2C91._matterc._udp
D: 1478 [SWU]Establishing session to provider node ID 0x0000000000000002 on fabric index 1
D: 1486 [CTL]OperationalSessionSetup[1:0000000000000002]: State change 1 --> 2
I: 1493 [DIS]Resolving 23B9886545C01AE8:0000000000000002 ...
I: 1654 [DIS]Node ID resolved for 23B9886545C01AE8:0000000000000002
D: 1661 [DIS] Hostname: B44506228D5C0000
D: 1665 [DIS] IP Address #1: fd11:db8:1::1
D: 1670 [DIS] Port: 5540
D: 1672 [DIS] Mrp Interval idle: 300 ms
D: 1676 [DIS] Mrp Interval active: 2000 ms
D: 1680 [DIS] TCP Supported: 1
D: 1684 [DIS]Lookup clearing interface for non LL address
I: 1690 [DIS]UDP:[fd11:db8:1::1]:5540: new best score: 4
I: 1695 [DIS]Checking node lookup status after 202 ms
I: 1701 [DIS]Keeping DNSSD lookup active
I: 1893 [DIS]Checking node lookup status after 400 ms
D: 1898 [DIS]OperationalSessionSetup[1:0000000000000002]: Updating device address to UDP:[fd11:db8:1::1]:5540 while in state 2
D: 1909 [CTL]OperationalSessionSetup[1:0000000000000002]: State change 2 --> 3
D: 1917 [IN]SecureSession[0x200044e0]: Allocated Type:2 LSID:64343
I: 1923 [SC]Initiating session on local FabricIndex 1 from 0x0000000000000001 -> 0x0000000000000002
I: 2065 [EM]<<< [E:32672i M:109555583] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
I: 2076 [IN](U) Sending msg 109555583 to IP address 'UDP:[fd11:db8:1::1]:5540'
I: 2086 [SC]Sent Sigma1 msg
/// from that point everything works///
Hmm, It looks like a very rare case and maybe there is not enough time to start the timer or something is scheduled at the wrong time.
OK, so in that case we are coming through OnOperationalNodeResolved
, right?
We call HandleAction
, which calls NextAction(), which returns NodeLookupResult::kKeepSearching
.
This is then supposed to call ReArmTimer
at the end of OnOperationalNodeResolved
and that is supposed to go through all the active lookups, for each one compare the current time to its min/max times, etc.
The key part is that we are using different concepts of "now" for the HandleAction
call and the ReArmTimer
call! And this hardware is pretty slow, so multiple milliseconds pass as we log things. In that first paste you can see these logs:
I: 2033 [DIS]Resolving 5FABA539C3FD84DE:0C82FE6CFCFB73FC ...
...
I: 2231 [DIS]Checking node lookup status after 198 ms
I: 2236 [DIS]Keeping DNSSD lookup active
2231 - 2033 = 198, but 2236 - 2033 = 203, and so by the time we have decided to keep going we have already lost: we are past the 200ms mark, and will not schedule that timer.
What probably needs to happen is that we use the same concept of "now" throughout one of these operations. Let me put together a possible fix.
What probably needs to happen is that we use the same concept of "now" throughout one of these operations
Or fix NextEventTimeout to handle this case right, which is what https://github.com/project-chip/connectedhomeip/pull/25103 does. Please try that and see whether it fixes the issues you were seeing @ArekBalysNordic.
@bzbarsky-apple It seems that your fix works properly. I've added a simple log: FIX occurred
to the "if statement" that you wrote to ensure we fall into it during that specific case, and got the following logs from the device:
I: 478 [DL]CHIP task running
I: 504 [ZCL]Cluster OnOff: attribute OnOff set to 0
I: 508 [ZCL]Cluster LevelControl: attribute CurrentLevel set to 254
I: 699 [DL]SRP Client was started, detected server: fdaf:052d:ba3d:9ed5:869c:8f3d:1f3d:d812
I: 1275 [DIS]Setting operational delegate post init
I: 1282 [DIS]Updating services using commissioning mode 0
I: 1287 [DIS]Advertise operational node 7C36FABB0FCF5689-0000000000000001
I: 1294 [DL]advertising srp service: 7C36FABB0FCF5689-0000000000000001._matter._tcp
I: 1302 [DIS]Advertise commission parameter vendorID=65521 productID=32773 discriminator=3840/15 cm=0
I: 1311 [DL]advertising srp service: 25F37E56572CB2EE._matterc._udp
D: 1318 [SWU]Establishing session to provider node ID 0x0000000000000002 on fabric index 1
D: 1326 [CTL]OperationalSessionSetup[1:0000000000000002]: State change 1 --> 2
I: 1333 [DIS]Resolving 7C36FABB0FCF5689:0000000000000002 ...
I: 1496 [DIS]Node ID resolved for 7C36FABB0FCF5689:0000000000000002
D: 1502 [DIS] Hostname: CC1531077F740000
D: 1506 [DIS] IP Address #1: fd11:db8:1::1
D: 1510 [DIS] Port: 5540
D: 1512 [DIS] Mrp Interval idle: 300 ms
D: 1516 [DIS] Mrp Interval active: 2000 ms
D: 1520 [DIS] TCP Supported: 1
D: 1522 [DIS]Lookup clearing interface for non LL address
I: 1528 [DIS]UDP:[fd11:db8:1::1]:5540: new best score: 4
I: 1533 [DIS]Checking node lookup status after 199 ms
I: 1537 [DIS]Keeping DNSSD lookup active
E: 1544 [DIS]FIX occurred.
I: 1547 [DIS]Checking node lookup status after 214 ms
D: 1552 [DIS]OperationalSessionSetup[1:0000000000000002]: Updating device address to UDP:[fd11:db8:1::1]:5540 while in state 2
D: 1563 [CTL]OperationalSessionSetup[1:0000000000000002]: State change 2 --> 3
D: 1571 [IN]SecureSession[0x200044e0]: Allocated Type:2 LSID:36494
I: 1577 [SC]Initiating session on local FabricIndex 1 from 0x0000000000000001 -> 0x0000000000000002
I: 1720 [EM]<<< [E:31355i M:192571744] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
I: 1766 [IN](U) Sending msg 192571744 to IP address 'UDP:[fd11:db8:1::1]:5540'
I: 1776 [SC]Sent Sigma1 msg
So, I can confirm that this fixes the issue. Thanks for your help! :)
Reproduction steps
Currently, the address resolve's minimum lookup time is set to the default value of 200 ms (https://github.com/project-chip/connectedhomeip/blob/master/src/lib/address_resolve/AddressResolve.h#L144). After performing OTA firmware update the nRF53 device working as MTD/FTD Thread device tries to send the confirmation to an OTA provider and does DNS resolve within ~200 ms. Sometimes, the time of DNS resolve is less than 200 ms, which causes continuing searching here: https://github.com/project-chip/connectedhomeip/blob/master/src/lib/address_resolve/AddressResolve_DefaultImpl.cpp#L80. and the second attempt of sending confirmation to the OTA provider occurs after elapsing the maximum lookup time (45 seconds). That is a quite long time. That issue doesn't exist on SED/SSED Thread device types, because there the resolving time is much higher.
Steps to reproduce:
Log from FTD/MTD nRF53 lighting-app example:
Checking node lookup status after 197 ms
Log from SED nRF53 doorlock-app example:
Checking node lookup status after 1076 ms
Bug prevalence
1-2 times a day
GitHub hash of the SDK that was being used
6ce0f901b50948280baaf56cb1778b58708e0eb2
Platform
nrf
Platform Version(s)
NCS 2.2.0
Anything else?
No response