project-chip / connectedhomeip

Matter (formerly Project CHIP) creates more connections between more objects, simplifying development for manufacturers and increasing compatibility for consumers, guided by the Connectivity Standards Alliance.
https://buildwithmatter.com
Apache License 2.0
7.56k stars 2.04k forks source link

[BUG] Matter never receives ack from Thread devices after operational discovery #33317

Open fostergorman opened 7 months ago

fostergorman commented 7 months ago

Reproduction steps

Hello, I am working on a PoC for a Matter hub and we want to have the hub also be a Thread Border Router. I have been working with this for a couple weeks now and have managed to get to operational discovery. Matter is able to connect to the device, but never receives an ack from the Thread devices.

This issue does not appear at all with WiFi devices.



Setup:

Raspberry Pi 4b with Ubuntu Server 23.10. OTBR running in docker container Commissioning with Matter (chip-tool)

chip-tool command executed: chip-tool pairing code-thread 69 hex:0e08000000000001000035060004001fffe00708fd38de8fb4627cf00c0402a0f7f8051000112233445566778899aabbccddeeff030e4f70656e54687265616444656d6f0410445f2b5ca6f2a93a55ce570a70efeecb000300000f0208111111112222222201021234 05576556571 --paa-trust-store-path /opt/matter/certs



Debugging:

Currently able to get to the point where the Thread device is commissioned on the Thread network using Matter's chip-tool. Can confirm this by checking the child table within OTBR: ot-ctl child table.

I have verified that there are no errors on the OTBR side. The packets are received and sent just fine.

I see the operational discovery packets using avahi-browse -a.

+  otbr0 IPv6 063787A15C2B0395-0000000000000045             _matter._tcp         local
+  otbr0 IPv4 063787A15C2B0395-0000000000000045             _matter._tcp         local

chip-tool outputs an error, Timeout waiting for mDNS resolution, while on the FindOperational step:

[1714999513.967859][6179:6181] CHIP:CTL: Successfully finished commissioning step 'ThreadNetworkEnable'
[1714999513.967917][6179:6181] CHIP:CTL: Commissioning stage next step: 'ThreadNetworkEnable' -> 'FindOperational'
[1714999513.967978][6179:6181] CHIP:CTL: Performing next commissioning step 'FindOperational'
[1714999513.968027][6179:6181] CHIP:IN: Expiring all sessions for node <0000000000000045, 1>!!
[1714999513.968081][6179:6181] CHIP:CSM: FindOrEstablishSession: PeerId = [1:0000000000000045]
[1714999513.968130][6179:6181] CHIP:CSM: FindOrEstablishSession: No existing OperationalSessionSetup instance found
[1714999513.968199][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: State change 1 --> 2
[1714999513.969669][6179:6181] CHIP:DMG: ICR moving to [AwaitingDe]
[1714999514.168146][6179:6181] CHIP:DIS: Checking node lookup status after 200 ms
[1714999515.439296][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999515.439771][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999515.441354][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999515.441658][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999515.441912][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999515.445875][6179:6181] CHIP:DIS: Lookup clearing interface for non LL address
[1714999515.446133][6179:6181] CHIP:DIS: UDP:[fd11:22::d012:4589:e76d:1f9e%otbr0]:5540: new best score: 3
[1714999515.446193][6179:6181] CHIP:DIS: Checking node lookup status after 1478 ms
[1714999515.446262][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: Updating device address to UDP:[fd11:22::d012:4589:e76d:1f9e]:5540 while in state 2
[1714999515.446317][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: State change 2 --> 3
[1714999515.446930][6179:6181] CHIP:IN: SecureSession[0xffff8001a8f0]: Allocated Type:2 LSID:57955
[1714999515.447048][6179:6181] CHIP:SC: Initiating session on local FabricIndex 1 from 0x000000000001B669 -> 0x0000000000000045
[1714999515.449066][6179:6181] CHIP:EM: <<< [E:60042i S:0 M:121203353] (U) Msg TX to 0:0000000000000000 [0000] [UDP:[fd11:22::d012:4589:e76d:1f9e]:5540] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
[1714999515.449546][6179:6181] CHIP:SC: Sent Sigma1 msg
[1714999515.449635][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: State change 3 --> 4
[1714999516.439920][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999516.440184][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999516.441701][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999516.441982][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999516.442226][6179:6181] CHIP:DIS: SRV record already actively processed.
[1714999516.545717][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999519.129858][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999519.130164][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999519.723194][6179:6181] CHIP:EM: Retransmitting MessageCounter:121203353 on exchange 60042i Send Cnt 1
[1714999521.713218][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999523.843283][6179:6181] CHIP:EM: Retransmitting MessageCounter:121203353 on exchange 60042i Send Cnt 2
[1714999524.200228][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999524.200499][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999524.326020][6179:6181] CHIP:DIS: Timeout waiting for mDNS resolution.
[1714999526.832197][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999529.318884][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999529.319227][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999529.976450][6179:6181] CHIP:EM: Retransmitting MessageCounter:121203353 on exchange 60042i Send Cnt 3
[1714999531.901893][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999534.437714][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999534.438058][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999537.020964][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999539.508301][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999539.508640][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999541.061654][6179:6181] CHIP:EM: Retransmitting MessageCounter:121203353 on exchange 60042i Send Cnt 4
[1714999542.090727][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999544.578227][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999544.578626][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999547.161008][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999549.696521][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999549.696854][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999552.279536][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999554.815223][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999554.815609][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999557.370905][6179:6181] CHIP:EM: Failed to Send CHIP MessageCounter:121203353 on exchange 60042i sendCount: 4 max retries: 4
[1714999557.447253][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999559.884884][6179:6180] CHIP:DL: Indication received, conn = 0xffff880378a0
[1714999559.885164][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16389
[1714999562.468218][6179:6181] CHIP:DL: HandlePlatformSpecificBLEEvent 16387
[1714999563.960047][6179:6181] CHIP:SC: CASESession timed out while waiting for a response from the peer. Current state was 1
[1714999563.960149][6179:6181] CHIP:IN: SecureSession[0xffff8001a8f0]: Released - Type:2 LSID:57955
[1714999563.960250][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: State change 4 --> 2
[1714999563.960324][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: State change 2 --> 1
[1714999563.960402][6179:6181] CHIP:DIS: OperationalSessionSetup:attempts done: 1, attempts left: 2, retry delay 1s, status src/system/SystemLayerImplSelect.cpp:214: Success
[1714999563.960456][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: State change 1 --> 6
[1714999563.960536][6179:6181] CHIP:CTL: Session establishment failed for <0000000000000045, 1>, error: src/protocols/secure_channel/CASESession.cpp:553: CHIP Error 0x00000032: Timeout.  Next retry expected to get a response to Sigma1 or fail within 8 seconds

TLDR: Matter fails operational discovery of Thread devices even though operational discovery, _matter._tcp, is shown through avahi.

Bug prevalence

Commissioning any Thread device

GitHub hash of the SDK that was being used

f52860e4d44a2653d167aa04552414e4c9015718

Platform

raspi

Platform Version(s)

1.2

Anything else?

No response

fostergorman commented 7 months ago

As an update to this, I found something that is off. During operational discovery, the name used for lookup within Matter is not the name that is publishing the operational discovery seen in avahi.

Matter lookup: 66D5DE6615F10EFA-0000000000000045 Name from avahi: 063787A15C2B0395-0000000000000045

I would imagine that if Matter is doing a lookup on the wrong name, it will never find it, so this seems to be the crux of the issue. I did verify that the IPv6 address Matter is using is the correct one of the device on the Thread network.

bzbarsky-apple commented 7 months ago

@fostergorman The log above shows the operational discovery succeeding just fine:

[1714999515.446262][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: Updating device address to UDP:[fd11:22::d012:4589:e76d:1f9e]:5540 while in state 2
[1714999515.446317][6179:6181] CHIP:DIS: OperationalSessionSetup[1:0000000000000045]: State change 2 --> 3
[1714999515.446930][6179:6181] CHIP:IN: SecureSession[0xffff8001a8f0]: Allocated Type:2 LSID:57955
[1714999515.447048][6179:6181] CHIP:SC: Initiating session on local FabricIndex 1 from 0x000000000001B669 -> 0x0000000000000045
[1714999515.449066][6179:6181] CHIP:EM: <<< [E:60042i S:0 M:121203353] (U) Msg TX to 0:0000000000000000 [0000] [UDP:[fd11:22::d012:4589:e76d:1f9e]:5540] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
[1714999515.449546][6179:6181] CHIP:SC: Sent Sigma1 msg

But then we never get a response or an ack for that message:

[1714999519.723194][6179:6181] CHIP:EM: Retransmitting MessageCounter:121203353 on exchange 60042i Send Cnt 1
[1714999523.843283][6179:6181] CHIP:EM: Retransmitting MessageCounter:121203353 on exchange 60042i Send Cnt 2
[1714999529.976450][6179:6181] CHIP:EM: Retransmitting MessageCounter:121203353 on exchange 60042i Send Cnt 3
[1714999541.061654][6179:6181] CHIP:EM: Retransmitting MessageCounter:121203353 on exchange 60042i Send Cnt 4
[1714999557.370905][6179:6181] CHIP:EM: Failed to Send CHIP MessageCounter:121203353 on exchange 60042i sendCount: 4 max retries: 4
[1714999563.960047][6179:6181] CHIP:SC: CASESession timed out while waiting for a response from the peer. Current state was 1

So:

  1. Is that the right IP that's being discovered?
  2. If it is, where are the packets being lost?
fostergorman commented 7 months ago

Thank you for your response.

  1. I have verified that the IP that Matter is using is the correct IP through OTBR.
  2. Within the OTBR logs, I'm seeing communication with the thread end device, and no errors indicating that communication has failed.
bzbarsky-apple commented 7 months ago

Well, clearly the Matter device doing the commissioning never got the response to Sigma1. So you should figure out where that response packet is being dropped in your specific network setup. Is it being sent by the Thread end device? If not, start there. If it is, then what happened to it?

bzbarsky-apple commented 7 months ago

(And please fix the PR summary, since operational discovery worked just fine.)