project-chip / connectedhomeip

Matter (formerly Project CHIP) creates more connections between more objects, simplifying development for manufacturers and increasing compatibility for consumers, guided by the Connectivity Standards Alliance.
https://buildwithmatter.com
Apache License 2.0
7.48k stars 2.01k forks source link

[BUG] CASE response message and ACK are not sent #25421

Closed kpark-apple closed 1 year ago

kpark-apple commented 1 year ago

Reproduction steps

Attempt to commission a Nordic device.

Sometimes, after enabling Thread, the commissioning failed and Nordic log showed no response to the Sigma2 and no ACK to its own messages as follows:

I: 20740 [EM]>>> [E:48157r M:131876630] (U) Msg RX from 0:6AFCAE0EBFDBD6D8 [000)
I: 20751 [IN]CASE Server received Sigma1 message. Starting handshake. EC 0x20008
I: 20759 [IN]CASE Server disabling CASE session setups                          
I: 20764 [EM]<<< [E:48157r M:164506657 (Ack:131876630)] (U) Msg TX to 0:0000000)
I: 20776 [IN](U) Sending msg 164506657 to IP address 'UDP:[fd00:c64f:26d5:471c:'
I: 20787 [SC]Received Sigma1 msg                                                
I: 20794 [SC]CASE matched destination ID: fabricIndex 1, NodeID 0x00000000759B99
I: 21027 [EM]<<< [E:48157r M:164506658 (Ack:131876630)] (U) Msg TX to 0:0000000)
I: 21039 [IN](U) Sending msg 164506658 to IP address 'UDP:[fd00:c64f:26d5:471c:'
I: 21051 [SC]Sent Sigma2 msg                                                    
D: 21059 [IN]Received a duplicate message with MessageCounter:131876630 on exchr
I: 21067 [EM]>>> [E:48157r M:131876630] (U) Msg RX from 0:6AFCAE0EBFDBD6D8 [000)
I: 21079 [EM]<<< [E:48157r M:164506659 (Ack:131876630)] (U) Msg TX to 0:0000000)
I: 21092 [IN](U) Sending msg 164506659 to IP address 'UDP:[fd00:c64f:26d5:471c:'
D: 21371 [IN]Received a duplicate message with MessageCounter:131876630 on exchr
I: 21380 [EM]>>> [E:48157r M:131876630] (U) Msg RX from 0:6AFCAE0EBFDBD6D8 [000)
I: 21392 [EM]<<< [E:48157r M:164506660 (Ack:131876630)] (U) Msg TX to 0:0000000)
I: 21405 [IN](U) Sending msg 164506660 to IP address 'UDP:[fd00:c64f:26d5:471c:'
I: 21442 [IN](U) Sending msg 164506658 to IP address 'UDP:[fd00:c64f:26d5:471c:'
I: 21852 [IN](U) Sending msg 164506658 to IP address 'UDP:[fd00:c64f:26d5:471c:'
D: 21964 [IN]Received a duplicate message with MessageCounter:131876630 on exchr
I: 21973 [EM]>>> [E:48157r M:131876630] (U) Msg RX from 0:6AFCAE0EBFDBD6D8 [000)
I: 21985 [EM]<<< [E:48157r M:164506661 (Ack:131876630)] (U) Msg TX to 0:0000000)
I: 21997 [IN](U) Sending msg 164506661 to IP address 'UDP:[fd00:c64f:26d5:471c:'
I: 22395 [IN](U) Sending msg 164506658 to IP address 'UDP:[fd00:c64f:26d5:471c:'
D: 22947 [IN]Received a duplicate message with MessageCounter:131876630 on exchr
I: 22955 [EM]>>> [E:48157r M:131876630] (U) Msg RX from 0:6AFCAE0EBFDBD6D8 [000)
I: 22966 [EM]<<< [E:48157r M:164506662 (Ack:131876630)] (U) Msg TX to 0:0000000)
I: 22979 [IN](U) Sending msg 164506662 to IP address 'UDP:[fd00:c64f:26d5:471c:'
I: 23291 [IN](U) Sending msg 164506658 to IP address 'UDP:[fd00:c64f:26d5:471c:'
E: 24801 [EM]Failed to Send CHIP MessageCounter:164506658 on exchange 48157r se4

However, on the commissioner side, it didn't receive any ACK or Sigma2 response from Nordic.

We setup a sniffer at 1 ft distance from the Matter Nordic device and saw no packets sent from Nordic at the time. The following were the message id 131876630 packets towards Nordic in the sniffer log based on timing:

10342   15:20:15.849375 fd00:c64f:26d5:471c:188a:c239:97a9:ae4  ::5900:1d04:5fbe:e0ee   WireGuard   30  Transport Data, receiver=0x07DC4716, counter=7709228040523077336, datalen=134
10353   15:20:16.168443 fd00:c64f:26d5:471c:188a:c239:97a9:ae4  ::5900:1d04:5fbe:e0ee   WireGuard   30  Transport Data, receiver=0x07DC4716, counter=7709228040523077336, datalen=134
10364   15:20:16.480896 fd00:c64f:26d5:471c:188a:c239:97a9:ae4  ::5900:1d04:5fbe:e0ee   WireGuard   30  Transport Data, receiver=0x07DC4716, counter=7709228040523077336, datalen=134
10393   15:20:17.074022 fd00:c64f:26d5:471c:188a:c239:97a9:ae4  ::5900:1d04:5fbe:e0ee   WireGuard   30  Transport Data, receiver=0x07DC4716, counter=7709228040523077336, datalen=134
10435   15:20:18.051799 fd00:c64f:26d5:471c:188a:c239:97a9:ae4  ::5900:1d04:5fbe:e0ee   WireGuard   30  Transport Data, receiver=0x07DC4716, counter=7709228040523077336, datalen=134

lighting-app was used.

Attached are Nordic log and the sniffer log. Thread network key index was 0 and the network key was D0CE4DD0ADDBFF67969A415901331C78.

Nordic-pairing-failed-new-home-j105-b520.txt

Thread-sniffer-Matter-pairing-fail.pcapng.zip

Bug prevalence

It happens very frequently, like half the time in a tester's setup.

GitHub hash of the SDK that was being used

df4ad9521b1dac98d7965f455fec74ffa270429c or 09ea93670798655d272ab98aaddb1b5ca05fe178

Platform

nrf

Platform Version(s)

PCA10056

Anything else?

No response

kpark-apple commented 1 year ago

Note that a clean build of lighting-app from SHA b89e83be43dde7a79d90823f4bc1279eb53f76de also showed the same issue.

bzbarsky-apple commented 1 year ago

@Damian-Nordic @kkasperczyk-no

ndyck14 commented 1 year ago

This may not be isolated to Nordic. I have seen this on silabs platforms as well.

ndyck14 commented 1 year ago

To clarify my own observations, I've observed this as recently as this morning and potentially as far back as in November.

For me, it is with a fresh reset device being commissioned to Thread over BLE characterized by:

  1. Device receives sigma1
  2. Device responds with sigma2
  3. sigma2 retries are exhausted
  4. Device receives subsequent sigma1 packets and forces solitary acks for all of them (my scenario has asymmetric MRP parameters such that all sigma2 retries exhaust before sigma1 retries arrive)
  5. Entire flow hits fail-safe

I have yet to capture this with an active sniffer (it seems to obey Heisenberg's uncertainty principle) so I cannot deduce if the device is not actually sending Thread packets (sigma2 OR subsequent standalone acks), a BR issue, or state problem with the commissioning side.

bzbarsky-apple commented 1 year ago

@ndyck14 The sniffer logs we have seem to indicate that:

1) The device is receiving packets. 2) The device is not sending packets.

But the device thinks its sending messages successfully....

If that's all true, then we need to do some debugging/logging as to how we can end up with what looks like a successful send from Matter's point of view without the message actually getting sent out.

s-mcclain commented 1 year ago

I have even seen this in my setup as well and this was using WiFi and not Thread (used an ESP32).. Not frequently, but in busy networks. In the Matter PoV it was all sent out successful, but nothing was actually sent out of the radio.

Damian-Nordic commented 1 year ago

@kpark-apple Hey, I couldn't find anything in the logs that would give us a hint why the device would not transmit frames for some periods of time. If that happens frequently in your setup, would it be possible that you build the sample with OpenThread logs to see if any error is reported to or by the OpenThread stack?

Please use the following build parameters to enable OpenThread logging:

$ west build -b nrf52840dk_nrf52840 -- -DCONF_FILE=prj_no_dfu.conf -DCONFIG_OPENTHREAD_DEBUG=y -DCONFIG_OPENTHREAD_LOG_LEVEL_WARN=y  -DCONFIG_LOG_MODE_DEFERRED=y -DCONFIG_LOG_BUFFER_SIZE=8192 -DCONFIG_OPENTHREAD_SOURCES=y

Note that -DCONF_FILE=prj_no_dfu.conf disables the DFU but this may be needed to reduce the flash usage to save more space for additional logging. If that doesn't report useful errors, you can also replace -DCONFIG_OPENTHREAD_LOG_LEVEL_WARN=y with -DCONFIG_OPENTHREAD_LOG_LEVEL_INFO=y to further increase the log verbosity.

LuDuda commented 1 year ago

@kpark-apple @bzbarsky-apple @ndyck14

I have analyzed the PCAP file, and it appears to be a Border Router configuration issue.

To make a long story short, the Thread accessory drops the packet (Sigma2) at its IP layer since it cannot find the route (and thus the next hop). In the attached PCAP (in the failed scenario), the Thread Border Router indicates that it is not the default route for external addresses. Please note that Sigma1 is sent from an external address (unknown prefix for Thread device), and Sigma2 has to be sent to the IPv6 Source address from Sigma1.

image

I took a look at the content of Network Data when pairing the Thread accessory to HomePod Mini today - everything worked - and as you can see below, the default route parameter is set to True by the Thread Border Router. The same thing happens when working with a standard OpenThread Border Router.

image

All in all, the Thread device needs to know where to forward the packet. To simplify a bit, if the IPv6 destination address is external (i.e., not an on-mesh prefix), the node first looks for the external route in the Network Data that matches the IPv6 destination address, or it looks for Border Router TLVs with the Default Route flag set. In the provided PCAP, neither of those conditions were met.

Additionally, I simulated the behavior using OpenThread CLI - and it seems to confirm the hypothesis.

Default Route flag set:

Thread Border Router:

> ipaddr add fd64::1/64   <- simulating external address which was used as Source IPv6 address of Sigma1
Done
> prefix add fc00:1:2:3::/64 paro high     <- register on-mesh prefix and set the "R" (default route) flag
Done
> netdata register
Done

Thread Device:

> ping fd64::1/64
16 bytes from f64:0:0:0:0:0:0:1: icmp_seq=5 hlim=64 time=4ms
1 packets transmitted, 1 packets received. Packet loss = 0.0%. Round-trip min/avg/max = 4/4.0/4 ms.
Done

Default Route flag clear (as in the PCAP):

Thread Border Router:

> ipaddr add fd64::1/64   <- simulating external address which was used as Source IPv6 address of Sigma1
Done
> prefix add fc00:1:2:3::/64 pao high     <- register on-mesh prefix and clear the "R" (default route) flag
Done
> netdata register
Done

Thread Device:

> ping fd64::1/64
1 packets transmitted, 0 packets received. Packet loss = 100.0%.      <- also no frames visible in the Wireshark
Done

To summarize, I would recommend checking why the Thread Border Router does not set the "R" (default route) flag when registering on-mesh prefix - which as I understand happens "sometimes". Alternatively, bind the controller to the IPv6 address known to the Thread Network (in the context of a prefix), or add external route to the prefix which is used in an IPv6 source address of the Sigma1.

Btw. The content of the Thread Network can be found in the MLE Child ID Response message.

AlanLCollins commented 1 year ago

@LuDuda ,I have used BR with default route flag = false, and the end device does not have a problem answering messages using OMR prefix (including Sigma messages). See picture below:

image

The OMR pre-fix in the Child Id response (pkt16) matches the OMR prefix which the BR searches in the CoAp a/aq (pkt44), which is the same pre-fix used for the Matter Sigma1 (pkt54) few packets later. Is there an extra code filter when the end device is resolving the OMR address next hop when trying to send back Sigma2 ? For your reference, these are the addresses assigned to BR and Matter light respectively for the example I'm sharing: image

LuDuda commented 1 year ago

Discussed with @AlanLCollins on Slack.

Your scenario is different, the Sigma1 is sent from the address which is known to the Thread node. In fact, it is exactly on-mesh prefix for the Thread network. Below is the screenshot from the .pcap you gave me and configured 6LoWPAN contexts according to Network Data and MeshCop information. The Thread node does not need to look for the default route, as the address is not "external" like in the original case in this issue.

image

AlanLCollins commented 1 year ago

An idea on how to reproduce this: enable IPv6 in your Wifi AP, so its RA will contain PIO. Then the BR will not include PIO on its own RAs. The Matter controller will assign its global IPv6 using prefix from Wifi AP.

kpark-apple commented 1 year ago

@LuDuda Thank you. Our border router team agreed with your assessment. I think this issue can be closed if there is no objection. If we hit on another no-transmission case without the same border router issue, I'll file a separate issue.

ndyck14 commented 1 year ago

so to add some clarity, the issue here is not entirely due the absense of a default route in the network data. OMR prefixes can be advertised without default route so long as there is a specific route contained in a separate prefix TLV. Although in the original report, this prefix (expect: fd00:c64f:26d5:471c::/64) is also not included: image

For my own observations, unfortunately the data I have points to a separte root cause, but I don't yet have data to definitively point to SDK specific issue.

I think whether this issue should remain open should be therefore up to @s-mcclain but perhaps reporting a new issue if its still valid is a better choice regardless.

s-mcclain commented 1 year ago

@ndyck14 I haven't been able to reproduce it again for some time now so if I do happen to see this problem again at a certain point I'll raise a new issue.

bzbarsky-apple commented 1 year ago

the Thread accessory drops the packet (Sigma2) at its IP layer

@LuDuda Is it possible to propagate this error up the callstack instead of this just being a silent failure?

ndyck14 commented 1 year ago

+1 ^ (github needs a proper emoji for that)

LuDuda commented 1 year ago

@bzbarsky-apple I think that at present, the UDP packet is queued in the IPv6 layer for subsequent processing, which means that the error won't be passed along within the same call stack. Furthermore, several issues may arise in the lower layers, such as 6Lo or MAC (e.g., CSMA/CA failure for 6/9 fragment) which cannot be passed in the same call stack anyway.

I will definitely explore methods to enhance the debugging experience. As it stands, we might not receive accurate logs from the OT stack.

LuDuda commented 1 year ago

FYI @bzbarsky-apple : https://github.com/openthread/openthread/pull/8907

@kpark-apple can we close this one?

kpark-apple commented 1 year ago

Closing as the root cause was BR and an open thread issue was filed to add more diagnostic log.