Closed kpark-apple closed 1 year ago
Note that a clean build of lighting-app from SHA b89e83be43dde7a79d90823f4bc1279eb53f76de also showed the same issue.
@Damian-Nordic @kkasperczyk-no
This may not be isolated to Nordic. I have seen this on silabs platforms as well.
To clarify my own observations, I've observed this as recently as this morning and potentially as far back as in November.
For me, it is with a fresh reset device being commissioned to Thread over BLE characterized by:
I have yet to capture this with an active sniffer (it seems to obey Heisenberg's uncertainty principle) so I cannot deduce if the device is not actually sending Thread packets (sigma2 OR subsequent standalone acks), a BR issue, or state problem with the commissioning side.
@ndyck14 The sniffer logs we have seem to indicate that:
1) The device is receiving packets. 2) The device is not sending packets.
But the device thinks its sending messages successfully....
If that's all true, then we need to do some debugging/logging as to how we can end up with what looks like a successful send from Matter's point of view without the message actually getting sent out.
I have even seen this in my setup as well and this was using WiFi and not Thread (used an ESP32).. Not frequently, but in busy networks. In the Matter PoV it was all sent out successful, but nothing was actually sent out of the radio.
@kpark-apple Hey, I couldn't find anything in the logs that would give us a hint why the device would not transmit frames for some periods of time. If that happens frequently in your setup, would it be possible that you build the sample with OpenThread logs to see if any error is reported to or by the OpenThread stack?
Please use the following build parameters to enable OpenThread logging:
$ west build -b nrf52840dk_nrf52840 -- -DCONF_FILE=prj_no_dfu.conf -DCONFIG_OPENTHREAD_DEBUG=y -DCONFIG_OPENTHREAD_LOG_LEVEL_WARN=y -DCONFIG_LOG_MODE_DEFERRED=y -DCONFIG_LOG_BUFFER_SIZE=8192 -DCONFIG_OPENTHREAD_SOURCES=y
Note that -DCONF_FILE=prj_no_dfu.conf
disables the DFU but this may be needed to reduce the flash usage to save more space for additional logging. If that doesn't report useful errors, you can also replace -DCONFIG_OPENTHREAD_LOG_LEVEL_WARN=y
with -DCONFIG_OPENTHREAD_LOG_LEVEL_INFO=y
to further increase the log verbosity.
@kpark-apple @bzbarsky-apple @ndyck14
I have analyzed the PCAP file, and it appears to be a Border Router configuration issue.
To make a long story short, the Thread accessory drops the packet (Sigma2) at its IP layer since it cannot find the route (and thus the next hop). In the attached PCAP (in the failed scenario), the Thread Border Router indicates that it is not the default route for external addresses. Please note that Sigma1 is sent from an external address (unknown prefix for Thread device), and Sigma2 has to be sent to the IPv6 Source address from Sigma1.
I took a look at the content of Network Data when pairing the Thread accessory to HomePod Mini today - everything worked - and as you can see below, the default route parameter is set to True by the Thread Border Router. The same thing happens when working with a standard OpenThread Border Router.
All in all, the Thread device needs to know where to forward the packet. To simplify a bit, if the IPv6 destination address is external (i.e., not an on-mesh prefix), the node first looks for the external route in the Network Data that matches the IPv6 destination address, or it looks for Border Router TLVs with the Default Route flag set. In the provided PCAP, neither of those conditions were met.
Additionally, I simulated the behavior using OpenThread CLI - and it seems to confirm the hypothesis.
Default Route flag set:
Thread Border Router:
> ipaddr add fd64::1/64 <- simulating external address which was used as Source IPv6 address of Sigma1
Done
> prefix add fc00:1:2:3::/64 paro high <- register on-mesh prefix and set the "R" (default route) flag
Done
> netdata register
Done
Thread Device:
> ping fd64::1/64
16 bytes from f64:0:0:0:0:0:0:1: icmp_seq=5 hlim=64 time=4ms
1 packets transmitted, 1 packets received. Packet loss = 0.0%. Round-trip min/avg/max = 4/4.0/4 ms.
Done
Default Route flag clear (as in the PCAP):
Thread Border Router:
> ipaddr add fd64::1/64 <- simulating external address which was used as Source IPv6 address of Sigma1
Done
> prefix add fc00:1:2:3::/64 pao high <- register on-mesh prefix and clear the "R" (default route) flag
Done
> netdata register
Done
Thread Device:
> ping fd64::1/64
1 packets transmitted, 0 packets received. Packet loss = 100.0%. <- also no frames visible in the Wireshark
Done
To summarize, I would recommend checking why the Thread Border Router does not set the "R" (default route) flag when registering on-mesh prefix - which as I understand happens "sometimes". Alternatively, bind the controller to the IPv6 address known to the Thread Network (in the context of a prefix), or add external route to the prefix which is used in an IPv6 source address of the Sigma1.
Btw. The content of the Thread Network can be found in the MLE Child ID Response message.
@LuDuda ,I have used BR with default route flag = false, and the end device does not have a problem answering messages using OMR prefix (including Sigma messages). See picture below:
The OMR pre-fix in the Child Id response (pkt16) matches the OMR prefix which the BR searches in the CoAp a/aq (pkt44), which is the same pre-fix used for the Matter Sigma1 (pkt54) few packets later. Is there an extra code filter when the end device is resolving the OMR address next hop when trying to send back Sigma2 ? For your reference, these are the addresses assigned to BR and Matter light respectively for the example I'm sharing:
Discussed with @AlanLCollins on Slack.
Your scenario is different, the Sigma1 is sent from the address which is known to the Thread node. In fact, it is exactly on-mesh prefix for the Thread network. Below is the screenshot from the .pcap you gave me and configured 6LoWPAN contexts according to Network Data and MeshCop information. The Thread node does not need to look for the default route, as the address is not "external" like in the original case in this issue.
An idea on how to reproduce this: enable IPv6 in your Wifi AP, so its RA will contain PIO. Then the BR will not include PIO on its own RAs. The Matter controller will assign its global IPv6 using prefix from Wifi AP.
@LuDuda Thank you. Our border router team agreed with your assessment. I think this issue can be closed if there is no objection. If we hit on another no-transmission case without the same border router issue, I'll file a separate issue.
so to add some clarity, the issue here is not entirely due the absense of a default route in the network data. OMR prefixes can be advertised without default route so long as there is a specific route contained in a separate prefix TLV. Although in the original report, this prefix (expect: fd00:c64f:26d5:471c::/64
) is also not included:
For my own observations, unfortunately the data I have points to a separte root cause, but I don't yet have data to definitively point to SDK specific issue.
I think whether this issue should remain open should be therefore up to @s-mcclain but perhaps reporting a new issue if its still valid is a better choice regardless.
@ndyck14 I haven't been able to reproduce it again for some time now so if I do happen to see this problem again at a certain point I'll raise a new issue.
the Thread accessory drops the packet (Sigma2) at its IP layer
@LuDuda Is it possible to propagate this error up the callstack instead of this just being a silent failure?
+1 ^ (github needs a proper emoji for that)
@bzbarsky-apple I think that at present, the UDP packet is queued in the IPv6 layer for subsequent processing, which means that the error won't be passed along within the same call stack. Furthermore, several issues may arise in the lower layers, such as 6Lo or MAC (e.g., CSMA/CA failure for 6/9 fragment) which cannot be passed in the same call stack anyway.
I will definitely explore methods to enhance the debugging experience. As it stands, we might not receive accurate logs from the OT stack.
FYI @bzbarsky-apple : https://github.com/openthread/openthread/pull/8907
@kpark-apple can we close this one?
Closing as the root cause was BR and an open thread issue was filed to add more diagnostic log.
Reproduction steps
Attempt to commission a Nordic device.
Sometimes, after enabling Thread, the commissioning failed and Nordic log showed no response to the Sigma2 and no ACK to its own messages as follows:
However, on the commissioner side, it didn't receive any ACK or Sigma2 response from Nordic.
We setup a sniffer at 1 ft distance from the Matter Nordic device and saw no packets sent from Nordic at the time. The following were the message id 131876630 packets towards Nordic in the sniffer log based on timing:
lighting-app was used.
Attached are Nordic log and the sniffer log. Thread network key index was 0 and the network key was D0CE4DD0ADDBFF67969A415901331C78.
Nordic-pairing-failed-new-home-j105-b520.txt
Thread-sniffer-Matter-pairing-fail.pcapng.zip
Bug prevalence
It happens very frequently, like half the time in a tester's setup.
GitHub hash of the SDK that was being used
df4ad9521b1dac98d7965f455fec74ffa270429c or 09ea93670798655d272ab98aaddb1b5ca05fe178
Platform
nrf
Platform Version(s)
PCA10056
Anything else?
No response