project-chip / connectedhomeip

Matter (formerly Project CHIP) creates more connections between more objects, simplifying development for manufacturers and increasing compatibility for consumers, guided by the Connectivity Standards Alliance.
https://buildwithmatter.com
Apache License 2.0
7.49k stars 2.01k forks source link

Message is sent to the wrong interface #11120

Closed kghost closed 2 years ago

kghost commented 3 years ago

Problem

My testing box have 2 different active network interfaces:

4: wlp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether d4:d2:52:79:9a:c8 brd ff:ff:ff:ff:ff:ff inet 192.168.90.187/24 brd 192.168.90.255 scope global dynamic noprefixroute wlp2s0 valid_lft 35978sec preferred_lft 35978sec 6: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 02:42:76:c8:60:61 brd ff:ff:ff:ff:ff:ff inet 192.168.9.1/24 brd 192.168.9.255 scope global docker0 valid_lft forever preferred_lft forever

The controller sends a message to 192.168.9.1, on the device side the message is received from 192.168.90.187. So the message is sent via the wrong socket.

Expected packet: 192.168.9.1:5541 -> 192.168.9.1:5540 Actual packet: 192.168.90.187:5541 -> 192.168.9.1:5540

we should response the message via the same interface where the message is received.

Scenario

This is triggered by scripts/tests/test_suites.sh

  1. Run ./out/debug/standalone/chip-all-clusters-app
  2. Run ./out/debug/chip-tool pairing qrcode ...
  3. Run ./out/debug/chip-tool test ...

During pairing phrase

Here is where the problem begins, in the following logs, the controller send a packet to 192.168.9.1, this address is discovered by mDNS, but the packet is sent via the wrong interface.

Controller side

[1635412318.928113][1128053:1128078] CHIP:DIS: Attempt to mDNS broadcast failed: ../../src/inet/IPEndPointBasis.cpp:687: OS Error 0x02000065: Network is unreachable [1635412318.928148][1128053:1128078] CHIP:DMG: ICR moving to [AwaitingDe] [1635412318.928159][1128053:1128078] CHIP:EM: Sending Standalone Ack for MessageCounter:6 on exchange 48303i [1635412318.928194][1128053:1128078] CHIP:IN: Prepared encrypted message 0x7fc110ada978 to 0x0000000012344321 of type 0x10 and protocolId (0, 0) on exchange 48303i with MessageCounter:11. [1635412318.928202][1128053:1128078] CHIP:IN: Sending encrypted msg 0x7fc110ada978 with MessageCounter:11 to 0x0000000012344321 at monotonic time: 1337236480 msec [1635412318.928208][1128053:1128078] CHIP:IN: Sending msg to address UDP:fdb9:e90f:246d:53e2:6a9c:214:9625:d001:5540 [1635412318.928249][1128053:1128078] CHIP:EM: Flushed pending ack for MessageCounter:6 on exchange 48303i [1635412318.928254][1128053:1128078] CHIP:EM: ec-- id: 48303i

[1635412318.928354][1128053:1128078] CHIP:IN: Marking old secure session for device 0x0000000012344321 as expired [1635412318.928360][1128053:1128078] CHIP:CTL: OnConnectionExpired was called for unknown device, ignoring it. [1635412318.928365][1128053:1128078] CHIP:IN: New secure session created for device 0x0000000012344321, key 1!! [1635412318.928601][1128053:1128078] CHIP:TOO: OnAddressUpdateComplete: 12344321: ../../src/controller/CHIPDevice.cpp:443: Success [1635412318.928607][1128053:1128078] CHIP:CTL: OperationalDiscoveryComplete for device ID 305414945 [1635412318.928611][1128053:1128078] CHIP:CTL: Enabling CASE session establishment for the device [1635412318.928713][1128053:1128078] CHIP:IN: Marking old secure session for device 0x0000000012344321 as expired [1635412318.928992][1128053:1128078] CHIP:IN: Creating UnauthenticatedSession with PeerAddress UDP:192.168.9.1:5540 [1635412318.929016][1128053:1128078] CHIP:EM: ec++ id: 48304i with delegate @0x7fc11b8e0160 [1635412318.929182][1128053:1128078] CHIP:IN: Prepared plaintext message 0x559821b33010 to 0x0000000000000000 of type 0x30 and protocolId (0, 0) on exchange 48304i with MessageCounter:1568095081. [1635412318.929197][1128053:1128078] CHIP:IN: Sending plaintext msg 0x559821b33010 with MessageCounter:1568095081 to 0x0000000000000000 at monotonic time: 1337236481 msec [1635412318.929205][1128053:1128078] CHIP:IN: Sending msg to address UDP:192.168.9.1:5540 [1635412318.929252][1128053:1128078] CHIP:SC: Sent Sigma1 msg [1635412318.929299][1128053:1128078] CHIP:DIS: Discovered node without a pending query

Device side:

[1635412318.929268][1128023:1128023] CHIP:IN: Received msg from address UDP:192.168.90.187:5541 [1635412318.929283][1128023:1128023] CHIP:IN: Creating UnauthenticatedSession with PeerAddress UDP:192.168.90.187:5541 [1635412318.929304][1128023:1128023] CHIP:EM: Received message of type 0x30 with protocolId (0, 0) and MessageCounter:1568095081 on exchange 48304r [1635412318.929339][1128023:1128023] CHIP:EM: ec++ id: 48304r with delegate @0x555c50c309a8 [1635412318.929346][1128023:1128023] CHIP:EM: Handling via exchange: 48304r, Delegate: 0x0x555c50c309a8 [1635412318.929357][1128023:1128023] CHIP:IN: CASE Server received Sigma1 message. Starting handshake. EC 0x555c50c30fb0 [1635412318.929364][1128023:1128023] CHIP:IN: Marking old secure session for device 0x0000000000000000 as expired [1635412318.929373][1128023:1128023] CHIP:DIS: CASE handshake initiated, closing all BLE Connections [1635412318.929380][1128023:1128023] CHIP:SC: Waiting for Sigma1 msg [1635412318.929385][1128023:1128023] CHIP:IN: CASE Server disabling CASE session setups [1635412318.929392][1128023:1128023] CHIP:SC: Received Sigma1 msg [1635412318.929405][1128023:1128023] CHIP:SC: Peer assigned session key ID 2 [1635412318.929414][1128023:1128023] CHIP:DIS: Fabric (1) loaded from storage. Calling OnFabricRetrievedFromStorage [1635412318.929420][1128023:1128023] CHIP:ZCL: OpCreds: Fabric 0x1 was retrieved from storage. FabricId 0x0000000000000000, NodeId 0x0000000012344321, VendorId 0x000A [1635412318.929426][1128023:1128023] CHIP:ZCL: OpCreds: Call to writeFabricsIntoFabricsListAttribute [1635412318.929460][1128023:1128023] CHIP:ZCL: OpCreds: Writing fabric into attribute store at index 0: fabricId 0x0000000000000000, nodeId 0x0000000012344321 vendorId 0x000A [1635412318.929473][1128023:1128023] CHIP:ZCL: OpCreds: Stored 1 fabrics in fabrics list attribute.

During test phrase

Controller sends message:

[1635412319.074908][1128079:1128093] CHIP:IN: Creating UnauthenticatedSession with PeerAddress UDP:192.168.90.187:5540 [1635412319.074938][1128079:1128093] CHIP:EM: ec++ id: 13075i with delegate @0x7f7da4560160 [1635412319.075071][1128079:1128093] CHIP:IN: Prepared plaintext message 0x55bdb50c3720 to 0x0000000000000000 of type 0x30 and protocolId (0, 0) on exchange 13075i with MessageCounter:2354484034. [1635412319.075085][1128079:1128093] CHIP:IN: Sending plaintext msg 0x55bdb50c3720 with MessageCounter:2354484034 to 0x0000000000000000 at monotonic time: 1337236627 msec [1635412319.075090][1128079:1128093] CHIP:IN: Sending msg to address UDP:192.168.90.187:5540 [1635412319.075136][1128079:1128093] CHIP:SC: Sent Sigma1 msg

Device receives message and sends reply

[1635412319.075169][1128023:1128023] CHIP:IN: Received msg from address UDP:192.168.90.187:5541 [1635412319.075191][1128023:1128023] CHIP:EM: Received message of type 0x30 with protocolId (0, 0) and MessageCounter:2354484034 on exchange 13075r [1635412319.075199][1128023:1128023] CHIP:EM: ec++ id: 13075r with delegate @0x555c50c309a8 [1635412319.075204][1128023:1128023] CHIP:EM: Handling via exchange: 13075r, Delegate: 0x0x555c50c309a8 [1635412319.075213][1128023:1128023] CHIP:IN: CASE Server received Sigma1 message. Starting handshake. EC 0x555c50c30fb0 [1635412319.075217][1128023:1128023] CHIP:DIS: CASE handshake initiated, closing all BLE Connections [1635412319.075224][1128023:1128023] CHIP:SC: Waiting for Sigma1 msg [1635412319.075228][1128023:1128023] CHIP:IN: CASE Server disabling CASE session setups [1635412319.075232][1128023:1128023] CHIP:SC: Received Sigma1 msg [1635412319.075244][1128023:1128023] CHIP:SC: Peer assigned session key ID 4 [1635412319.075249][1128023:1128023] CHIP:DIS: Fabric (1) loaded from storage. Calling OnFabricRetrievedFromStorage [1635412319.075253][1128023:1128023] CHIP:ZCL: OpCreds: Fabric 0x1 was retrieved from storage. FabricId 0x0000000000000000, NodeId 0x0000000012344321, VendorId 0x000A [1635412319.075258][1128023:1128023] CHIP:ZCL: OpCreds: Call to writeFabricsIntoFabricsListAttribute [1635412319.075292][1128023:1128023] CHIP:ZCL: OpCreds: Writing fabric into attribute store at index 0: fabricId 0x0000000000000000, nodeId 0x0000000012344321 vendorId 0x000A [1635412319.075304][1128023:1128023] CHIP:ZCL: OpCreds: Stored 1 fabrics in fabrics list attribute. [1635412319.075339][1128023:1128023] CHIP:DIS: Fabric (1) loaded from storage. Calling OnFabricRetrievedFromStorage [1635412319.075346][1128023:1128023] CHIP:ZCL: OpCreds: Fabric 0x1 was retrieved from storage. FabricId 0x0000000000000000, NodeId 0x0000000012344321, VendorId 0x000A [1635412319.075352][1128023:1128023] CHIP:ZCL: OpCreds: Call to writeFabricsIntoFabricsListAttribute [1635412319.075372][1128023:1128023] CHIP:ZCL: OpCreds: Writing fabric into attribute store at index 0: fabricId 0x0000000000000000, nodeId 0x0000000012344321 vendorId 0x000A [1635412319.075381][1128023:1128023] CHIP:ZCL: OpCreds: Stored 1 fabrics in fabrics list attribute. [1635412319.075681][1128023:1128023] CHIP:EM: Piggybacking Ack for MessageCounter:2354484034 on exchange: 13075r [1635412319.075692][1128023:1128023] CHIP:IN: Prepared plaintext message 0x555c50c30ea8 to 0x0000000000000000 of type 0x31 and protocolId (0, 0) on exchange 13075r with MessageCounter:1414302350. [1635412319.075699][1128023:1128023] CHIP:IN: Sending plaintext msg 0x555c50c30ea8 with MessageCounter:1414302350 to 0x0000000000000000 at monotonic time: 1337236627 msec [1635412319.075705][1128023:1128023] CHIP:IN: Sending msg to address UDP:192.168.90.187:5541 [1635412319.075749][1128023:1128023] CHIP:SC: Sent Sigma2 msg

Controller receives message:

[1635412319.075767][1128079:1128093] CHIP:IN: Received msg from address UDP:192.168.9.1:5540 [1635412319.075780][1128079:1128093] CHIP:IN: Creating UnauthenticatedSession with PeerAddress UDP:192.168.9.1:5540 [1635412319.075794][1128079:1128093] CHIP:EM: Received message of type 0x31 with protocolId (0, 0) and MessageCounter:1414302350 on exchange 13075i

Due to that the message is on a different UnauthenticatedSession, it is unable to match the session with the original exchange, so it is unable to handle the message.

Proposed Solution

kghost commented 3 years ago

@kpschoedel @yufengwangca @bzbarsky-apple @pan-apple

This bug is causing some serious problems, we should fix it before TE7

kghost commented 3 years ago

Confirm that the PeerAddress, fetched from mDNS doesn't contain proper interface id.

cecille commented 3 years ago

Related: #11009

cecille commented 3 years ago

The choice not to specify an interface seems quite deliberate - from the DeviceController OnNodeIdResolved call:

    // Only use the mDNS resolution's InterfaceID for addresses that are IPv6 LLA.
    // For all other addresses, we should rely on the device's routing table to route messages sent.
    // Forcing messages down an InterfaceId might fail. For example, in bridged networks like Thread,
    // mDNS advertisements are not usually received on the same interface the peer is reachable on.
    if (nodeData.mAddress.IsIPv6LinkLocal())
    {
        interfaceId = nodeData.mInterfaceId;
    }

What actually has to be matched in order for the session to match. Ie, how is this address used? Matching on exact address seems fragile - IMO nodes should be free to send messages on whatever interface is most convenient.

From the spec (section 4.14.2.4. Exchange Message Processing)

Upon receipt of a message by the Exchange Layer, the message is matched to an existing Exchange.
A given message is part of an Exchange if it satisfies all the following criteria:

1. The Exchange ID of the message matches the Exchange ID of the Exchange,
2. The message has the I Flag set and its Source Node ID is the Initiator of the Exchange,
OR it does not have the I Flag set and its Destination Node ID is the Initiator of the Exchange.
3. The message was received over a secure session that is associated to the Fabric ID and Node ID
of the Peer Node.

^ Does matching the secure session require matching the ip address?

cecille commented 3 years ago

The mdns resolver does track the interface, but keep in mind that there are addresses that are reachable over multiple interfaces. Right now, Device and SecureSession assume a single address for a peer, and that address is currently whatever address was last in the mdns packet on whatever interface that was received on. If we rely on strict addresses to match sessions, we'll need to track the interfaces, make sure we don't overwrite on receipt of new mdns packets and also track the local address that we're tagging packets with. That sounds overly restrictive. My preference would be to have the Device class (and maybe even the SecureSession) track all the addresses for a device so it can fall back in case of reachability issues. We also have a gap on the other side because there's nothing to restrict the sender from selecting amongst their own addresses for the source. The device side really has no way to know all the available addresses from the controller side.

TL;DR - if there's a way to match exchange contexts without using the IP addresses, we should do that.

kghost commented 3 years ago

Is not about how to match the address, the address we are using to reply is wrong.

Expected behavior:

First session:
Controller 192.168.9.1:5541 -> 192.168.9.1:5540 Device
           192.168.9.1:5541 <- 192.168.9.1:5540
The device cache this session (peer address 192.168.9.1:5541, local interface 192.168.9.1:5540)
Second session:
Controller 192.168.9.1:5541 -> 192.168.9.1:5540 Device
           192.168.9.1:5541 <- 192.168.9.1:5540

Actual behavior

First session:
Controller 192.168.90.187:5541 -> 192.168.9.1:5540 Device
           192.168.90.187:5541 <- 192.168.9.1:5540
The device cache this session (peer address 192.168.90.187:5541, local interface 192.168.9.1:5540)
Second session:
Controller 192.168.90.187:5541 -> 192.168.90.187:5540 Device (The session matches cached session, because the peer address is same)
           192.168.90.187:5541 <- 192.168.9.1:5540 Error: The controller sends a packet to 192.168.90.187:5540 but got a response from 192.168.9.1:5540, it is unable to process the packet.
cecille commented 3 years ago

yeah, but what does it mean to use a "wrong" address. I can't find anywhere in the spec that says we're matching on address, so in theory anything that gets the message there should be ok.

cecille commented 3 years ago

For TE7, I suppose what we can do is have the device set the address only if another address is not already set.

cecille commented 3 years ago

The general challenge with that is that if the device really DOES need to switch its address, we'll be doing the wrong thing.

cecille commented 3 years ago

I suppose my question is really - given that the device does receive the message on the "incorrect" address, why is the message being rejected? What part of the code is taking the address into account? And how does that match to spec?

bzbarsky-apple commented 3 years ago

The part of the spec that matches on address is the matching for unencrypted sessions. That has to match on transport address, because there's nothing else to match on.

The real question here is this: how is the interface (and hence source address) selected based on the destination address in this case?

cecille commented 3 years ago

Perhaps I misunderstood then - is this happening during the case session establishment? Or when we start to send encrypted messages?

kghost commented 3 years ago

Reopen, because it still need a permanent fix

msandstedt commented 2 years ago

Spec PR, https://github.com/CHIP-Specifications/connectedhomeip-spec/issues/4674, is now merged so work on this can resume.