Closed bzbarsky-apple closed 3 years ago
It looks like we init the Device
, then later reset it on error with a stack like so:
which I suspect corresponds to this part of the log:
[1624924111550] [0x3df13a] CHIP: [CTL] Finding node on operational network
[1624924111550] [0x3df13a] CHIP: [IN] Encrypted message 0x70000c026b70 from 0x000000000001B669 to 0x0000000000BC5C01 of type 16 and protocolId 0 on exchange 44255.
[1624924111550] [0x3df13a] CHIP: [IN] Sending msg 0x70000c026b70 to 0x0000000000BC5C01 at utc time: 404729692 msec
[1624924111550] [0x3df13a] CHIP: [IN] Sending secure msg on generic transport
[1624924111550] [0x3df13a] CHIP: [IN] Secure msg send status No Error
[1624924112163] [0x3df13a] CHIP: [DIS] Node ID resolved for 0x0000000000BC5C01 to fe80::fef5:c4ff:fe30:e4e4
[1624924112163] [0x3df13a] CHIP: [CTL] Calling commissioning complete
[1624924112164] [0x3df13a] CHIP: [IN] Encrypted message 0x7fe41a02fe58 from 0x000000000001B669 to 0x0000000000BC5C01 of type 8 and protocolId 5 on exchange 44256.
[1624924112164] [0x3df13a] CHIP: [IN] Sending msg 0x7fe41a02fe58 to 0x0000000000BC5C01 at utc time: 404730305 msec
[1624924112164] [0x3df13a] CHIP: [IN] Sending secure msg on generic transport
[1624924112164] [0x3df13a] CHIP: [IN] Secure msg send status No Error
Node address has been updated
[1624924112164] [0x3df13a] CHIP: [CTL] OperationalDiscoveryComplete for device ID 12344321
Device temporary node id (**this does not match spec**): 12344321
[1624924112164] [0x3df13a] CHIP: [DMG] Time out! failed to receive invoke command response from Exchange: 44256
[1624924112164] [0x3df13a] CHIP: [ZCL] DefaultResponse:
[1624924112164] [0x3df13a] CHIP: [ZCL] Transaction: 0x115ef71f0
[1624924112164] [0x3df13a] CHIP: [ZCL] status: EMBER_ZCL_STATUS_FAILURE (0x01)
[1624924112164] [0x3df13a] CHIP: [CTL] Received failure response 1
So DeviceCommissioner::OperationalDiscoveryComplete
calls device->OperationalCertProvisioned()
which winds via the callstack above and ends up resetting that very device object (already questionable). But then we call PersistDevice(device);
with a now-reset device, which is no good.
What I don't understand is why this does not bite our other controllers, the ones we test in CI...
So with chip-tool
what happens is that we get DeviceCommissioner::OnOperationalCertificateAddResponse
before we get OnNodeIdResolved
. That calls into DeviceCommissioner::OnOperationalCredentialsProvisioningCompletion
which calls DeviceCommissioner::RendezvousCleanup
. That's because in that case CONFIG_USE_CLUSTERS_FOR_IP_COMMISSIONING is not defined (why?) so we effectively take the !mIsIPRendezvous
path. In chip-device-ctrl, this is defined by default (at least as built by ./scripts/build_python.sh
, so there we take the AdvanceCommissioningStage
path instead of doing a RendezvousCleanup
.
If I compile chip-tool with chip_use_clusters_for_ip_commissioning=true
I get the same crash there.
So some obvious questions:
CONFIG_USE_CLUSTERS_FOR_IP_COMMISSIONING
define? Or should that just always be true? @cecille CommissioningComplete
command under DeviceCommissioner::OnNodeIdResolved
before we call OperationalDiscoveryComplete
and start tearing things down. It seems to me like we should sort out our switch to the operational network and only then do OperationalDiscoveryComplete
. That's effectively what we end up doing in the !CONFIG_USE_CLUSTERS_FOR_IP_COMMISSIONING
case, by doing the RendezvousCleanup
a lot earlier in the process....DeviceCommissioner::OperationalDiscoveryComplete
should perhaps check that device->OperationalCertProvisioned()
does not tear down the device for some other silly reason before trying to persist it? I guess if we had had that we would not have noticed the ordering issue item 2 describes. Also, I still find it really weird to have a Device
method that ends up destroying the Device
.....OK, so I tried the fix from my question 2 combined with compiling chip-tool with chip_use_clusters_for_ip_commissioning=true
. The result ends up doing this:
[1624927763705] [0x3f101f] CHIP: [CTL] Enabling CASE session establishment for the device
[1624927763705] [0x3f101f] CHIP: [IN] Connection from 'UDP:fe80::fef5:c4ff:fe30:e4e4:11097' expired
[1624927763707] [0x3f101f] CHIP: [SC] Sent SigmaR1 msg
[1624927763707] [0x3f101f] CHIP: [CTL] Calling commissioning complete
[1624927763707] [0x3f101f] CHIP: [DMG] ICR moving to [Initialize]
[1624927763707] [0x3f101f] CHIP: [DMG] ICR moving to [AddCommand]
[1624927763707] [0x3f101f] CHIP: [-] CHIP Error 4072 (0x00000FE8): Not connected at ../../../examples/chip-tool/third_party/connectedhomeip/src/app/CommandSender.cpp:75
[1624927763707] [0x3f101f] CHIP: [DMG] ICR moving to [Uninitiali]
[1624927764111] [0x3f100e] CHIP: [DL] Mdns: OnRegister name: 0000000000000000-0000000000000000, type: _chip._tcp., domain: local., flags: 2
[1624927764363] [0x3f100e] CHIP: [DL] Mdns: OnRegister name: 0000000000000000-0000000000BC5C01, type: _chip._tcp., domain: local., flags: 2
[1624927764363] [0x3f100e] CHIP: [DL] Mdns: OnRegister name: 0000000000000000-0000000000BC5C01, type: _chip._tcp., domain: local., flags: 2
[1624927768676] [0x3f1020] CHIP: [EM] Retransmit MsgId:00000000 Send Cnt 1
[1624927768693] [0x3f100f] CHIP: [DIS] Commissioning errored out with error 4050
which is not surprising. DeviceCommissioner::OnNodeIdResolved
is calling AdvanceCommissioningStage
immediately, which tries to send the CommissioningComplete
command immediately, before waiting for CASE setup.
I also just checked and if I compile chip-tool with chip_use_clusters_for_ip_commissioning=false
then it never reaches the "Calling commissioning complete"
log line that corresponds to advancing to the kSendComplete
stage. Which is clearly not a spec-compliant thing to be doing....
Also, I am confused by the semantics of the DevicePairingDelegate::SecurePairingSuccess
status update. This seems to be sent in the following cases:
CONFIG_USE_CLUSTERS_FOR_IP_COMMISSIONING
is false or if we're doing non-IP rendezvous: right after OnOperationalCredentialsProvisioningCompletion
, which is way before we're ready to start doing anything operational; at this point the consumer still needs to configure and enable networks (for the non-IP-rendezvous) case, etc.kCleanup
stage, which is after we have gotten a response back for the CommissioningComplete command.I don't see how consumers can sanely make sense of that... We really need to have our notifications come in some deterministic order and need our state machine to not look wildly different in these different cases. And ideally have fewer cases.
Problem
Steps to reproduce:
48ed12b9f196e8c384ca102d6663d0766694d65e
.scripts/examples/gn_build_example.sh examples/all-clusters-app/linux out/debug/standalone chip_config_network_layer_ble=false
./scripts/build_python.sh
./out/debug/standalone/chip-all-clusters-app
source ./out/python_env/bin/activate && chip-device-ctrl
chip-device-ctrl
prompt runconnect -ip ::1 20202021 12344321
The python controller crashes.
Stack to crash:
``` (lldb) bt * thread #2, queue = 'com.zigbee.chip.framework.controller.workqueue', stop reason = EXC_BAD_ACCESS (code=1, address=0x30) * frame #0: 0x00000001098bb77c _ChipDeviceCtrl.so`chip::Transport::PeerAddress::IsInitialized(this=0x0000000000000020) const at PeerAddress.h:102:56 frame #1: 0x00000001098bb70c _ChipDeviceCtrl.so`chip::Transport::PeerConnectionState::IsInitialized(this=0x0000000000000020) at PeerConnectionState.h:87:30 frame #2: 0x00000001098b8ae1 _ChipDeviceCtrl.so`chip::Transport::PeerConnections<16ul, (chip::Time::Source)0>::FindPeerConnectionState(this=0x0000000000000018, nodeId=(mValue = 18446744073709551615, mHasValue = true), peerKeyId=0, begin=0x0000000000000000) at PeerConnections.h:224:24 frame #3: 0x00000001098b7e03 _ChipDeviceCtrl.so`chip::SecureSessionMgr::GetPeerConnectionState(this=0x0000000000000000, session=(mPeerNodeId = 18446744073709551615, mPeerKeyId = 0, mAdmin = 65535)) at SecureSessionMgr.cpp:550:29 frame #4: 0x000000010984ce91 _ChipDeviceCtrl.so`chip::Controller::Device::Serialize(this=0x00007f9910088028, output=0x00007000046b4f20) at CHIPDevice.cpp:166:73 frame #5: 0x000000010984d64e _ChipDeviceCtrl.so`chip::Controller::Device::Persist(this=0x00007f9910088028) at CHIPDevice.cpp:287:9 frame #6: 0x0000000109852357 _ChipDeviceCtrl.so`chip::Controller::DeviceController::PersistDevice(this=0x00007f9910088000, device=0x00007f9910088028) at CHIPDeviceController.cpp:517:17 frame #7: 0x00000001098557a5 _ChipDeviceCtrl.so`chip::Controller::DeviceCommissioner::OperationalDiscoveryComplete(this=0x00007f9910088000, remoteDeviceId=12344321) at CHIPDeviceController.cpp:1082:5 frame #8: 0x0000000109856fd9 _ChipDeviceCtrl.so`chip::Controller::DeviceCommissioner::OnNodeIdResolved(this=0x00007f9910088000, nodeData=0x00007000046b5118) at CHIPDeviceController.cpp:1538:5 frame #9: 0x000000010987b011 _ChipDeviceCtrl.so`chip::Mdns::DiscoveryImplPlatform::HandleNodeIdResolve(context=0x00000001098faa38, result=0x00007000046b5238, error=0) at Discovery_ImplPlatform.cpp:521:29 frame #10: 0x0000000109892ce6 _ChipDeviceCtrl.so`chip::Mdns::OnGetAddrInfo(sdRef=0x00007f993bf5d2e0, flags=1073741827, interfaceId=6, err=0, hostname="FCF5C430E4E4.local.", address=0x00007000046b52e0, ttl=120, context=0x00007f993bf04710) at MdnsImpl.cpp:394:5 frame #11: 0x00007fff674b7032 libsystem_dnssd.dylib`handle_addrinfo_response + 553 frame #12: 0x00007fff674b6b5d libsystem_dnssd.dylib`DNSServiceProcessResult + 674 frame #13: 0x00007fff67372658 libdispatch.dylib`_dispatch_client_callout + 8 ```What's happening here is that we land in
DeviceCommissioner::OperationalDiscoveryComplete
, callController::DeviceController::PersistDevice
, end up inDevice::Serialize
, and crash getting the connection state becausemSessionManager
is null.Bisect shows this is a regression from #7666. @pan-apple
Proposed Solution
Figure out where things are going wrong and fix the crash.