project-chip / connectedhomeip

Matter (formerly Project CHIP) creates more connections between more objects, simplifying development for manufacturers and increasing compatibility for consumers, guided by the Connectivity Standards Alliance.
https://buildwithmatter.com
Apache License 2.0
7.5k stars 2.01k forks source link

Adding multiple admins fails with CASE error on TE6 branch #10337

Closed cjandhyala closed 2 years ago

cjandhyala commented 3 years ago

Adding 2nd admin(2nd python control) fails On TE6 branch (82bcd4e36f4856a4216cb76a16116faeec40e223).

Steps to reproduce :

  1. Launch lighting app on a raspy
  2. commission the lighting app with first python controller (running on raspi) over BLE
  3. Open commissioning window
  4. add the 2nd admin (2nd python controller running on different raspi than 1st controller) using connect -IP cmd
  5. 2nd commissioner resolves the node, but fails to establish CASE session. (This setup uses 3 different raspis , one for accessory, one for 1st controller , one 2nd controller)

Attached logs, 1st_controller.txt 2ndController.txt accessory_log.txt

bluebin14 commented 3 years ago

It works for me, however I used our own BLE/Thread device instead of Lighting app. It looks like an mdns/routing problem, can you try with both controllers on same raspi as a test?

cecille commented 3 years ago

It doesn't seem like an mdns problem to me - if you look at line 370/371 on the 2ndController.txt, you can see that avahi resolves the address, and I confirmed with Chaitanya that it's as expected. However, there is an error on line 386 that's concerning - have you seen OnMessageReceived failed, err = ../../src/messaging/ExchangeMgr.cpp:298: CHIP Error 0x0000002A: Invalid message type happen before? Do you know what that might be about?

jmartinez-silabs commented 3 years ago

I see a lot of flakyness on our platform too with multi-admin (EFR32) on both TE6 branch and ToT. The commissionning flow fails for the 2nd admin

bluebin14 commented 3 years ago

controller1 resolves ipv6 address fd7b:a81b:9770:0:dea6:32ff:fe8d:7072 but controller2 gets given ipv4 address 192.168.4.223. Then avahi resolves ipv4 address 192.168.6.222 which is again different. My setup uses ipv6 exclusively and it works.

bzbarsky-apple commented 3 years ago

So it looks like controller2 correctly resolves the device. It sends a Sigma1 (line 383 of 2ndController.txt). The device receives it (line 2358 of accessory_log.txt) and sends a Sigma2 (line 2394). The commissioner receives the Sigma2 but (starting line 384 in 2ndController.txt):

[1633646951.870298][3191:3199] CHIP:EM: Received message of type 0x31 with vendorId 0x0000 and protocolId 0x0000 on exchange 13378i
[1633646951.870406][3191:3199] CHIP:EM: Handling via exchange: 13378i, Delegate: 0x(nil)
[1633646951.870469][3191:3199] CHIP:EM: OnMessageReceived failed, err = ../../src/messaging/ExchangeMgr.cpp:298: CHIP Error 0x0000002A: Invalid message type

So most pressingly, the exchange delegate is null, so the Sigma2 message is not delivered to the CASE state machine and then the handshake fails.

I am also not sure what the deal is with the "invalid message type" message there. That means this check tested true:

        if (ec->IsEncryptionRequired() != packetHeader.GetFlags().Has(Header::FlagValues::kEncryptedMessage))

which is pretty odd in this case.

Anyway, after this the commissioner retransmits Sigma1 a few times, and the other side retransmits Sigma2 a few times, but they all get dropped as duplicates. And in any case, on the server side we had, 5 seconds after sending Sigma2 (line 2396 in the accessory log):

[1633646956.576393][3488:3488] CHIP:SVR: Commissioning failed with error ../../examples/lighting-app/linux/third_party/connectedhomeip/src/platform/DeviceControlServer.cpp:48: CHIP Error 0x00000032: Timeout
[1633646956.576453][3488:3488] CHIP:SVR: Commissioning failed (attempt 1): ../../examples/lighting-app/linux/third_party/connectedhomeip/src/platform/DeviceControlServer.cpp:48: CHIP Error 0x00000032: Timeout

which happens before the retransmits.

But the big question is what closed the CASE establishment exchange on 2ndController after sending Sigma1, so that the delegate ended up null. @pan-apple any ideas?

bluebin14 commented 3 years ago

Here is what closed the CASE exchange: In accessory_log.txt: At 1633646951: Arm Fail-safe command with ExpiryLengthSeconds 5, meaning 5 seconds max till receiving CommissioningComplete command, At 1633646956 i.e. 5 seconds later: CHIP Error 0x00000032: Timeout Looks like 5 seconds is not enough for commissioning.

tcarmelveilleux commented 3 years ago

5 seconds is definitely no enough! CASE can take longer than that on very constrained CPUs

cecille commented 3 years ago

K, so let's bump that up and see if it resolves things - that was set pre-case.

bzbarsky-apple commented 3 years ago

Here is what closed the CASE exchange: In accessory_log.txt:

That's on the server, though. But the exchange is already closed on the client (commissioner) when we get the Sigma2....

bzbarsky-apple commented 3 years ago

That said, @cjandhyala could you try the change in #10357 and see if that helps?

jmartinez-silabs commented 3 years ago

Hi, I cherry-picked #10357 on the test_event_6 branch (82bcd4e) for the two controllers.

Multi-admin still often fails on my end for the second admin. I use the same setup as describe in this ticket except that my end device node is a EFR32 lighting app

I mostly see two different errors when it fails

1) controller # 2 give me the following error at the OpCSRResponse

Sigma3_TO_Controller#1.txt Sigma3_TO_EFR32.txt Sigma3_TO_Controller#2.txt

Received certificate signing request from the device
[1634068509.841325][3044:3052] CHIP:CTL: Getting certificate chain for the device from the issuer
[1634068509.841372][3044:3052] CHIP:CTL: Verifying Certificate Signing Request
[1634068509.841956][3044:3052] CHIP:CR:  ssl err  elliptic curve routines EC_POINT_set_affine_coordinates point is not on curve

[1634068509.842054][3044:3052] CHIP:CR:  ssl err  elliptic curve routines o2i_ECPublicKey EC lib

[1634068509.842138][3044:3052] CHIP:CR:  ssl err  elliptic curve routines eckey_pub_decode decode error

[1634068509.842217][3044:3052] CHIP:CR:  ssl err  x509 certificate routines x509_pubkey_decode public key decode error

[1634068509.842292][3044:3052] CHIP:CTL: Failed to process the certificate signing request
[1634068509.842371][3044:3052] CHIP:CTL: SyncSetKeyValue on StartKeyID
[1634068509.842531][3044:3052] CHIP:CTL: SyncSetKeyValue on PairedDevicec9
Failed to establish secure session to device: 172

2) Controller # 2 timeout in on packet after sending SigmaR3 msg

Sent SigmaR3 msg
[1634070546.199595][3124:3132] CHIP:IN: New secure session created for device 0x00000000000000CA, key 3!!
[1634070546.199758][3124:3132] CHIP:CTL: Calling commissioning complete
[1634070546.199841][3124:3132] CHIP:DMG: ICR moving to [Initialize]
[1634070546.199961][3124:3132] CHIP:DMG: ICR moving to [AddCommand]
[1634070546.200395][3124:3132] CHIP:IN: Secure message was encrypted: Msg ID 0
[1634070546.200421][3124:3132] CHIP:IN: Build encrypted message 0x36f60d90 to 0x00000000000000CA of type 8 and protocolId 1 on exchange 6460r with MessageCounter 0.
[1634070546.200555][3124:3132] CHIP:IN: Sending encrypted msg 0x36f60d90 to 0x00000000000000CA at utc time: 12316362 msec
[1634070546.200580][3124:3132] CHIP:IN: Sending msg on generic transport
[1634070546.200705][3124:3132] CHIP:DMG: ICR moving to [   Sending]
[1634070551.187118][3124:3132] CHIP:IN: Sending plaintext msg 0x36f60d78 to 0x0000000000000000 at utc time: 12321348 msec
[1634070551.187218][3124:3132] CHIP:IN: Sending msg on generic transport
[1634070551.187434][3124:3132] CHIP:EM: Retransmit MessageCounter:A5298C7A Send Cnt 1
[1634070551.187521][3124:3132] CHIP:IN: Sending encrypted msg 0x36f60d90 to 0x00000000000000CA at utc time: 12321349 msec
[1634070551.187572][3124:3132] CHIP:IN: Sending msg on generic transport
[1634070551.187852][3124:3132] CHIP:EM: Retransmit MessageCounter:00000000 Send Cnt 1
[1634070551.440478][3124:3132] CHIP:IN: Sending plaintext msg 0x36f60d78 to 0x0000000000000000 at utc time: 12321602 msec
[1634070551.440554][3124:3132] CHIP:IN: Sending msg on generic transport
[1634070551.440724][3124:3132] CHIP:EM: Retransmit MessageCounter:A5298C7A Send Cnt 2
[1634070551.440808][3124:3132] CHIP:IN: Sending encrypted msg 0x36f60d90 to 0x00000000000000CA at utc time: 12321602 msec
[1634070551.440896][3124:3132] CHIP:IN: Sending msg on generic transport
[1634070551.441142][3124:3132] CHIP:EM: Retransmit MessageCounter:00000000 Send Cnt 2
[1634070551.696626][3124:3132] CHIP:IN: Sending plaintext msg 0x36f60d78 to 0x0000000000000000 at utc time: 12321858 msec
[1634070551.696701][3124:3132] CHIP:IN: Sending msg on generic transport
[1634070551.696857][3124:3132] CHIP:EM: Retransmit MessageCounter:A5298C7A Send Cnt 3
[1634070551.696937][3124:3132] CHIP:IN: Sending encrypted msg 0x36f60d90 to 0x00000000000000CA at utc time: 12321858 msec
[1634070551.697051][3124:3132] CHIP:IN: Sending msg on generic transport
[1634070551.697226][3124:3132] CHIP:EM: Retransmit MessageCounter:00000000 Send Cnt 3
[1634070551.952746][3124:3132] CHIP:EM: Failed to Send CHIP MessageCounter:A5298C7A sendCount: 3 max retries: 3
[1634070551.952855][3124:3132] CHIP:EM: Failed to Send CHIP MessageCounter:00000000 sendCount: 3 max retries: 3
[1634070558.203486][3124:3132] CHIP:DMG: Time out! failed to receive invoke command response from Exchange: 6460i
[1634070558.203663][3124:3132] CHIP:ZCL: DefaultResponse:
[1634070558.203723][3124:3132] CHIP:ZCL:   Transaction: 0xffffabaf7390
[1634070558.203771][3124:3132] CHIP:ZCL:   status: EMBER_ZCL_STATUS_FAILURE (0x01)
[1634070558.203832][3124:3132] CHIP:CTL: Received failure response 1

[1634070558.203943][3124:3132] CHIP:CTL: SyncSetKeyValue on StartKeyID
[1634070558.204161][3124:3132] CHIP:CTL: SyncSetKeyValue on PairedDeviceca
Failed to establish secure session to device: 1

OTBR fails to update the last SRP service after receiving SigmaR3. I am not sure if it is related. The following log snipped is from the EFR32 lighting app.

00> <detail> [SC] Received SigmaR3 msg
00> <detail> [DL] Thread packet RCVD: UDP, len 100
00> <detail> [DL]     src  FD11:33::DEA6:32FF:FE72:7537, port 44111
00> <detail> [DL]     dest FD11:22::F387:97AD:2458:B238, port 5540
00> <error > [DL] OnSrpClientNotification: Operation refused for security reasons
00> <error > [DL] OnSrpClientNotification: Operation refused for security reasons
00> <detail> [DL] Thread packet RCVD: UDP, len 645
00> <detail> [DL]     src  FD11:33::DEA6:32FF:FE72:7537, port 44111
00> <detail> [DL]     dest FD11:22::F387:97AD:2458:B238, port 5540

SSL_ERR_Controller#1.txt SSL_ERR_EFR32.txt SSL_ERR_Controller#2.txt

I am attaching all 3 logs for both scenario. (Ctrl 1, Ctrl 2, EFR32_Light_app)

Can you make any sense of either of those failures?

cecille commented 3 years ago

I think I'm running into the same issue now, and I'm noticing some logs that suggest we may be adding opcerts twice. It's hard to tell because the device attestation code re-uses some of the same stuff, but I'm seeing two entries in the fabric table for one commissioning.

msandstedt commented 2 years ago

https://github.com/project-chip/connectedhomeip/issues/12821 will contribute to this.

bzbarsky-apple commented 2 years ago

@cjandhyala Is this still an issue?

woody-apple commented 2 years ago

Issue Review: Closing issues prior to TE8. Please re-file if this is still an issue on newer SDKs