[DNSSD] Address resolve minimum lookup time seems to be too small for MTD/FTD Thread devices.

ArekBalysNordic commented 1 year ago

Reproduction steps

Currently, the address resolve's minimum lookup time is set to the default value of 200 ms (https://github.com/project-chip/connectedhomeip/blob/master/src/lib/address_resolve/AddressResolve.h#L144). After performing OTA firmware update the nRF53 device working as MTD/FTD Thread device tries to send the confirmation to an OTA provider and does DNS resolve within ~200 ms. Sometimes, the time of DNS resolve is less than 200 ms, which causes continuing searching here: https://github.com/project-chip/connectedhomeip/blob/master/src/lib/address_resolve/AddressResolve_DefaultImpl.cpp#L80. and the second attempt of sending confirmation to the OTA provider occurs after elapsing the maximum lookup time (45 seconds). That is a quite long time. That issue doesn't exist on SED/SSED Thread device types, because there the resolving time is much higher.

How the minimum and maximum values were calculated? Why the minimum value is set to 200 ms?
It is possible to parametrize the minimum and maximum lookup values and set them to various values depending on the platform?

Steps to reproduce:

Build the lighting-app example for the Nrfconnect platform nRF53 device.
Flash nRF53 device.
Perform OTA firmware update.
wait for reboot and images swap.
The device tries to send a confirmation to the OTA requestor - the issue occurs.

Log from FTD/MTD nRF53 lighting-app example: Checking node lookup status after 197 ms

Log from SED nRF53 doorlock-app example: Checking node lookup status after 1076 ms

Bug prevalence

1-2 times a day

GitHub hash of the SDK that was being used

6ce0f901b50948280baaf56cb1778b58708e0eb2

Platform

nrf

Platform Version(s)

NCS 2.2.0

Anything else?

No response

bzbarsky-apple commented 1 year ago

@ArekBalysNordic I am not following the failure scenario. Is it possible to attach a log that shows the problem?

The intent of the current timers is that we should:

Not even try to report results before the min time has elapsed, because we might as well wait for more complete results. This only makes sense if the min time is fairly short. 200ms was picked ... I am not sure how. @andy31415
Stop trying to report results once the max time has elapsed; treat it as lookup failure. This max time used to be 15s, but that was too short, and it was raised to 45s.

So there are three ways a resolve can go:

Any results we get before 200ms have elapsed we will end up in Resolver::HandleAction, call NextAction, elapsed < mRequest.GetMinLookupTime() and we will do the "keep searching" thing as linked above. Then when the 200ms timer fires we will call HandleAction from the timer callback and report those results to our consumer.
Any results we get after the 200ms timer has elapsed we will end up in Resolver::HandleAction, call NextAction, and report those results to our consumer immediately.
If we reach the 45s timer with no results, we fail out.

We should not be waiting until the 45s timer if we have results sitting around waiting to be reported.

ArekBalysNordic commented 1 year ago

@bzbarsky-apple, I've debugged it more today to collect as much info/logs as possible and I can see that behavior occurs only when the Lookup status time is near the kMinLookupTimeMsDefault value. For example, when the kMinLookupTimeMsDefault was set to 200, so as it is currently, and we got Checking node lookup status after 199 ms we waited 45 seconds for the next query. I tried to reproduce it according to my suspicion (it occurs when the time is smaller than kMinLookupTimeMsDefault) and I changed the kMinLookupTimeMsDefault value to 400 ms. Then everything worked as you described - The first query occurred after 202 ms, the device did "keep searching", and then the next one occurred immediately after the declared 400 ms and the message to the OTA provider was sent.

There are some logs: 1) Case, when the kMinLookupTimeMsDefault is set to 200 ms and the checking occurs after 198 ms:

//// Device rebooted, and the new firmware has been applied ///

I: 461 [DL]CHIP task running
I: 1308 [DL]SRP Client was started, detected server: fd82:300d:1f31:56e0:9d48:ffb0:424e:5865
I: 1975 [DIS]Setting operational delegate post init
I: 1981 [DIS]Updating services using commissioning mode 0
I: 1986 [DIS]Advertise operational node 5FABA539C3FD84DE-516BB32D2607E4F2
I: 1993 [DL]advertising srp service: 5FABA539C3FD84DE-516BB32D2607E4F2._matter._tcp
I: 2002 [DIS]Advertise commission parameter vendorID=65521 productID=32768 discriminator=3840/15 cm=0
I: 2011 [DL]advertising srp service: 60E5A449DE838D77._matterc._udp
D: 2018 [SWU]Establishing session to provider node ID 0x0C82FE6CFCFB73FC on fabric index 1
D: 2026 [CTL]OperationalSessionSetup[1:0C82FE6CFCFB73FC]: State change 1 --> 2
I: 2033 [DIS]Resolving 5FABA539C3FD84DE:0C82FE6CFCFB73FC ...
I: 2195 [DIS]Node ID resolved for 5FABA539C3FD84DE:0C82FE6CFCFB73FC
D: 2201 [DIS]    Hostname: 0242C0A808030000
D: 2205 [DIS]    IP Address #1: fd11:db8:1::3
D: 2209 [DIS]    Port: 5540
D: 2211 [DIS]    Mrp Interval idle: 300 ms
D: 2215 [DIS]    Mrp Interval active: 2000 ms
D: 2218 [DIS]    TCP Supported: 1
D: 2221 [DIS]Lookup clearing interface for non LL address
I: 2226 [DIS]UDP:[fd11:db8:1::3]:5540: new best score: 4
I: 2231 [DIS]Checking node lookup status after 198 ms
I: 2236 [DIS]Keeping DNSSD lookup active
I: 47033 [DIS]Checking node lookup status after 45000 ms
D: 47038 [DIS]OperationalSessionSetup[1:0C82FE6CFCFB73FC]: Updating device address to UDP:[fd11:db8:1::3]:5540 while in state 2
D: 47049 [CTL]OperationalSessionSetup[1:0C82FE6CFCFB73FC]: State change 2 --> 3
D: 47057 [IN]SecureSession[0x20004638]: Allocated Type:2 LSID:21705
I: 47063 [SC]Initiating session on local FabricIndex 1 from 0x516BB32D2607E4F2 -> 0x0C82FE6CFCFB73FC
I: 47206 [EM]<<< [E:52970i M:62295202] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
I: 47217 [IN](U) Sending msg 62295202 to IP address 'UDP:[fd11:db8:1::3]:5540'
I: 47226 [SC]Sent Sigma1 msg

/// from that point everything works///

2) Case, when the kMinLookupTimeMsDefault is set to 400 ms and the checking occurs after 202 ms:

//// Device rebooted, and the new firmware has been applied ///

I: 479 [DL]CHIP task running
I: 504 [ZCL]Cluster OnOff: attribute OnOff set to 0
I: 509 [ZCL]Cluster LevelControl: attribute CurrentLevel set to 254
I: 1253 [DL]SRP Client was started, detected server: fd82:4ffc:1b42:e9b0:7a67:e8f2:44da:8558
I: 1435 [DIS]Setting operational delegate post init
I: 1442 [DIS]Updating services using commissioning mode 0
I: 1447 [DIS]Advertise operational node 23B9886545C01AE8-0000000000000001
I: 1454 [DL]advertising srp service: 23B9886545C01AE8-0000000000000001._matter._tcp
I: 1462 [DIS]Advertise commission parameter vendorID=65521 productID=32773 discriminator=3840/15 cm=0
I: 1472 [DL]advertising srp service: FBD3F8BAC26F2C91._matterc._udp
D: 1478 [SWU]Establishing session to provider node ID 0x0000000000000002 on fabric index 1
D: 1486 [CTL]OperationalSessionSetup[1:0000000000000002]: State change 1 --> 2
I: 1493 [DIS]Resolving 23B9886545C01AE8:0000000000000002 ...
I: 1654 [DIS]Node ID resolved for 23B9886545C01AE8:0000000000000002
D: 1661 [DIS]   Hostname: B44506228D5C0000
D: 1665 [DIS]   IP Address #1: fd11:db8:1::1
D: 1670 [DIS]   Port: 5540
D: 1672 [DIS]   Mrp Interval idle: 300 ms
D: 1676 [DIS]   Mrp Interval active: 2000 ms
D: 1680 [DIS]   TCP Supported: 1
D: 1684 [DIS]Lookup clearing interface for non LL address
I: 1690 [DIS]UDP:[fd11:db8:1::1]:5540: new best score: 4
I: 1695 [DIS]Checking node lookup status after 202 ms
I: 1701 [DIS]Keeping DNSSD lookup active
I: 1893 [DIS]Checking node lookup status after 400 ms
D: 1898 [DIS]OperationalSessionSetup[1:0000000000000002]: Updating device address to UDP:[fd11:db8:1::1]:5540 while in state 2
D: 1909 [CTL]OperationalSessionSetup[1:0000000000000002]: State change 2 --> 3
D: 1917 [IN]SecureSession[0x200044e0]: Allocated Type:2 LSID:64343
I: 1923 [SC]Initiating session on local FabricIndex 1 from 0x0000000000000001 -> 0x0000000000000002
I: 2065 [EM]<<< [E:32672i M:109555583] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
I: 2076 [IN](U) Sending msg 109555583 to IP address 'UDP:[fd11:db8:1::1]:5540'
I: 2086 [SC]Sent Sigma1 msg

/// from that point everything works///

Hmm, It looks like a very rare case and maybe there is not enough time to start the timer or something is scheduled at the wrong time.

bzbarsky-apple commented 1 year ago

OK, so in that case we are coming through OnOperationalNodeResolved, right?

We call HandleAction, which calls NextAction(), which returns NodeLookupResult::kKeepSearching.

This is then supposed to call ReArmTimer at the end of OnOperationalNodeResolved and that is supposed to go through all the active lookups, for each one compare the current time to its min/max times, etc.

The key part is that we are using different concepts of "now" for the HandleAction call and the ReArmTimer call! And this hardware is pretty slow, so multiple milliseconds pass as we log things. In that first paste you can see these logs:

I: 2033 [DIS]Resolving 5FABA539C3FD84DE:0C82FE6CFCFB73FC ...
...
I: 2231 [DIS]Checking node lookup status after 198 ms
I: 2236 [DIS]Keeping DNSSD lookup active

2231 - 2033 = 198, but 2236 - 2033 = 203, and so by the time we have decided to keep going we have already lost: we are past the 200ms mark, and will not schedule that timer.

What probably needs to happen is that we use the same concept of "now" throughout one of these operations. Let me put together a possible fix.

bzbarsky-apple commented 1 year ago

What probably needs to happen is that we use the same concept of "now" throughout one of these operations

Or fix NextEventTimeout to handle this case right, which is what https://github.com/project-chip/connectedhomeip/pull/25103 does. Please try that and see whether it fixes the issues you were seeing @ArekBalysNordic.

ArekBalysNordic commented 1 year ago

@bzbarsky-apple It seems that your fix works properly. I've added a simple log: FIX occurred to the "if statement" that you wrote to ensure we fall into it during that specific case, and got the following logs from the device:

I: 478 [DL]CHIP task running
I: 504 [ZCL]Cluster OnOff: attribute OnOff set to 0
I: 508 [ZCL]Cluster LevelControl: attribute CurrentLevel set to 254
I: 699 [DL]SRP Client was started, detected server: fdaf:052d:ba3d:9ed5:869c:8f3d:1f3d:d812
I: 1275 [DIS]Setting operational delegate post init
I: 1282 [DIS]Updating services using commissioning mode 0
I: 1287 [DIS]Advertise operational node 7C36FABB0FCF5689-0000000000000001
I: 1294 [DL]advertising srp service: 7C36FABB0FCF5689-0000000000000001._matter._tcp
I: 1302 [DIS]Advertise commission parameter vendorID=65521 productID=32773 discriminator=3840/15 cm=0
I: 1311 [DL]advertising srp service: 25F37E56572CB2EE._matterc._udp
D: 1318 [SWU]Establishing session to provider node ID 0x0000000000000002 on fabric index 1
D: 1326 [CTL]OperationalSessionSetup[1:0000000000000002]: State change 1 --> 2
I: 1333 [DIS]Resolving 7C36FABB0FCF5689:0000000000000002 ...
I: 1496 [DIS]Node ID resolved for 7C36FABB0FCF5689:0000000000000002
D: 1502 [DIS]   Hostname: CC1531077F740000
D: 1506 [DIS]   IP Address #1: fd11:db8:1::1
D: 1510 [DIS]   Port: 5540
D: 1512 [DIS]   Mrp Interval idle: 300 ms
D: 1516 [DIS]   Mrp Interval active: 2000 ms
D: 1520 [DIS]   TCP Supported: 1
D: 1522 [DIS]Lookup clearing interface for non LL address
I: 1528 [DIS]UDP:[fd11:db8:1::1]:5540: new best score: 4
I: 1533 [DIS]Checking node lookup status after 199 ms
I: 1537 [DIS]Keeping DNSSD lookup active

E: 1544 [DIS]FIX occurred.

I: 1547 [DIS]Checking node lookup status after 214 ms
D: 1552 [DIS]OperationalSessionSetup[1:0000000000000002]: Updating device address to UDP:[fd11:db8:1::1]:5540 while in state 2
D: 1563 [CTL]OperationalSessionSetup[1:0000000000000002]: State change 2 --> 3
D: 1571 [IN]SecureSession[0x200044e0]: Allocated Type:2 LSID:36494
I: 1577 [SC]Initiating session on local FabricIndex 1 from 0x0000000000000001 -> 0x0000000000000002
I: 1720 [EM]<<< [E:31355i M:192571744] (U) Msg TX to 0:0000000000000000 [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
I: 1766 [IN](U) Sending msg 192571744 to IP address 'UDP:[fd11:db8:1::1]:5540'
I: 1776 [SC]Sent Sigma1 msg

So, I can confirm that this fixes the issue. Thanks for your help! :)

project-chip / connectedhomeip