Crash doing onnetwork connection with chip-device-ctrl

bzbarsky-apple commented 3 years ago

Problem

Steps to reproduce:

Check out revision 48ed12b9f196e8c384ca102d6663d0766694d65e.
scripts/examples/gn_build_example.sh examples/all-clusters-app/linux out/debug/standalone chip_config_network_layer_ble=false
./scripts/build_python.sh
In one terminal, run ./out/debug/standalone/chip-all-clusters-app
In a second terminal run source ./out/python_env/bin/activate && chip-device-ctrl
At the chip-device-ctrl prompt run connect -ip ::1 20202021 12344321

The python controller crashes.

Stack to crash:

``` (lldb) bt * thread #2, queue = 'com.zigbee.chip.framework.controller.workqueue', stop reason = EXC_BAD_ACCESS (code=1, address=0x30) * frame #0: 0x00000001098bb77c _ChipDeviceCtrl.so`chip::Transport::PeerAddress::IsInitialized(this=0x0000000000000020) const at PeerAddress.h:102:56 frame #1: 0x00000001098bb70c _ChipDeviceCtrl.so`chip::Transport::PeerConnectionState::IsInitialized(this=0x0000000000000020) at PeerConnectionState.h:87:30 frame #2: 0x00000001098b8ae1 _ChipDeviceCtrl.so`chip::Transport::PeerConnections<16ul, (chip::Time::Source)0>::FindPeerConnectionState(this=0x0000000000000018, nodeId=(mValue = 18446744073709551615, mHasValue = true), peerKeyId=0, begin=0x0000000000000000) at PeerConnections.h:224:24 frame #3: 0x00000001098b7e03 _ChipDeviceCtrl.so`chip::SecureSessionMgr::GetPeerConnectionState(this=0x0000000000000000, session=(mPeerNodeId = 18446744073709551615, mPeerKeyId = 0, mAdmin = 65535)) at SecureSessionMgr.cpp:550:29 frame #4: 0x000000010984ce91 _ChipDeviceCtrl.so`chip::Controller::Device::Serialize(this=0x00007f9910088028, output=0x00007000046b4f20) at CHIPDevice.cpp:166:73 frame #5: 0x000000010984d64e _ChipDeviceCtrl.so`chip::Controller::Device::Persist(this=0x00007f9910088028) at CHIPDevice.cpp:287:9 frame #6: 0x0000000109852357 _ChipDeviceCtrl.so`chip::Controller::DeviceController::PersistDevice(this=0x00007f9910088000, device=0x00007f9910088028) at CHIPDeviceController.cpp:517:17 frame #7: 0x00000001098557a5 _ChipDeviceCtrl.so`chip::Controller::DeviceCommissioner::OperationalDiscoveryComplete(this=0x00007f9910088000, remoteDeviceId=12344321) at CHIPDeviceController.cpp:1082:5 frame #8: 0x0000000109856fd9 _ChipDeviceCtrl.so`chip::Controller::DeviceCommissioner::OnNodeIdResolved(this=0x00007f9910088000, nodeData=0x00007000046b5118) at CHIPDeviceController.cpp:1538:5 frame #9: 0x000000010987b011 _ChipDeviceCtrl.so`chip::Mdns::DiscoveryImplPlatform::HandleNodeIdResolve(context=0x00000001098faa38, result=0x00007000046b5238, error=0) at Discovery_ImplPlatform.cpp:521:29 frame #10: 0x0000000109892ce6 _ChipDeviceCtrl.so`chip::Mdns::OnGetAddrInfo(sdRef=0x00007f993bf5d2e0, flags=1073741827, interfaceId=6, err=0, hostname="FCF5C430E4E4.local.", address=0x00007000046b52e0, ttl=120, context=0x00007f993bf04710) at MdnsImpl.cpp:394:5 frame #11: 0x00007fff674b7032 libsystem_dnssd.dylib`handle_addrinfo_response + 553 frame #12: 0x00007fff674b6b5d libsystem_dnssd.dylib`DNSServiceProcessResult + 674 frame #13: 0x00007fff67372658 libdispatch.dylib`_dispatch_client_callout + 8 ```

What's happening here is that we land in DeviceCommissioner::OperationalDiscoveryComplete, call Controller::DeviceController::PersistDevice, end up in Device::Serialize, and crash getting the connection state because mSessionManager is null.

Bisect shows this is a regression from #7666. @pan-apple

Proposed Solution

Figure out where things are going wrong and fix the crash.

bzbarsky-apple commented 3 years ago

It looks like we init the Device, then later reset it on error with a stack like so:

Stack

``` frame #0: 0x0000000115e4e152 _ChipDeviceCtrl.so`chip::Controller::Device::Reset(this=0x00007fe3f0008028) at CHIPDevice.cpp:425:21 frame #1: 0x0000000115e52b89 _ChipDeviceCtrl.so`chip::Controller::DeviceController::ReleaseDevice(this=0x00007fe3f0008000, device=0x00007fe3f0008028) at CHIPDeviceController.cpp:626:13 frame #2: 0x0000000115e56c60 _ChipDeviceCtrl.so`chip::Controller::DeviceCommissioner::ReleaseDevice(this=0x00007fe3f0008000, device=0x00007fe3f0008028) at CHIPDeviceController.cpp:1418:23 frame #3: 0x0000000115e52be7 _ChipDeviceCtrl.so`chip::Controller::DeviceController::ReleaseDevice(this=0x00007fe3f0008000, index=0) at CHIPDeviceController.cpp:633:9 frame #4: 0x0000000115e55435 _ChipDeviceCtrl.so`chip::Controller::DeviceCommissioner::RendezvousCleanup(this=0x00007fe3f0008000, status=1) at CHIPDeviceController.cpp:1108:27 * frame #5: 0x0000000115e55877 _ChipDeviceCtrl.so`chip::Controller::DeviceCommissioner::OnSessionEstablishmentError(this=0x00007fe3f0008000, err=1) at CHIPDeviceController.cpp:1128:5 frame #6: 0x0000000115e533a6 _ChipDeviceCtrl.so`chip::Controller::BasicFailure(context=0x00007fe3f0008000, status='\x01') at CHIPDeviceController.cpp:1523:19 frame #7: 0x0000000115e0f0da _ChipDeviceCtrl.so`IMDefaultResponseCallback(commandObj=0x0000000115ef71f0, status=EMBER_ZCL_STATUS_FAILURE) at CHIPClientCallbacks.cpp:309:9 frame #8: 0x0000000115e56ef4 _ChipDeviceCtrl.so`chip::Controller::DeviceControllerInteractionModelDelegate::CommandResponseError(this=0x0000000115ef5280, apCommandSender=0x0000000115ef71f0, aError=4050) at CHIPDeviceController.cpp:1501:5 frame #9: 0x0000000115da7984 _ChipDeviceCtrl.so`chip::Controller::PythonInteractionModelDelegate::CommandResponseError(this=0x0000000115ef5280, apCommandSender=0x0000000115ef71f0, aError=4050) at Delegate.cpp:70:51 frame #10: 0x0000000115e2d866 _ChipDeviceCtrl.so`chip::app::CommandSender::OnResponseTimeout(this=0x0000000115ef71f0, apExchangeContext=0x00007fe41a02ff60) at CommandSender.cpp:123:21 frame #11: 0x0000000115e81eec _ChipDeviceCtrl.so`chip::Messaging::ExchangeContext::NotifyResponseTimeout(this=0x00007fe41a02ff60) at ExchangeContext.cpp:368:19 frame #12: 0x0000000115e81ea1 _ChipDeviceCtrl.so`chip::Messaging::ExchangeContext::OnConnectionExpired(this=0x00007fe41a02ff60) at ExchangeContext.cpp:321:5 frame #13: 0x0000000115e85050 _ChipDeviceCtrl.so`auto chip::Messaging::ExchangeManager::OnConnectionExpired(this=0x000070000c026d60, ec=0x00007fe41a02ff60)::$_2::operator()(chip::Messaging::ExchangeContext*) const at ExchangeMgr.cpp:329:17 frame #14: 0x0000000115e84718 _ChipDeviceCtrl.so`bool chip::BitMapObjectPool::ForEachActiveObject(chip::Messaging::ExchangeManager::OnConnectionExpired(chip::SecureSessionHandle)::$_2) at Pool.h:140:26 frame #15: 0x0000000115e8460a _ChipDeviceCtrl.so`chip::Messaging::ExchangeManager::OnConnectionExpired(this=0x00007fe41a02fe00, session=(mPeerNodeId = 12344321, mPeerKeyId = 0, mAdmin = 0)) at ExchangeMgr.cpp:326:18 frame #16: 0x0000000115eb9fcc _ChipDeviceCtrl.so`chip::SecureSessionMgr::HandleConnectionExpired(this=0x00007fe41a040800, state=0x00007fe41a040820) at SecureSessionMgr.cpp:529:14 frame #17: 0x0000000115ebb7f0 _ChipDeviceCtrl.so`chip::SecureSessionMgr::ExpirePairing(this=0x000070000c026f78, state1=0x00007fe41a040820)::$_0::operator()(chip::Transport::PeerConnectionState const&) const at SecureSessionMgr.cpp:223:76 frame #18: 0x0000000115eb84b4 _ChipDeviceCtrl.so`void chip::Transport::PeerConnections<16ul, (chip::Time::Source)0>::MarkConnectionExpired(chip::Transport::PeerConnectionState*, chip::SecureSessionMgr::ExpirePairing(chip::SecureSessionHandle)::$_0) at PeerConnections.h:329:9 frame #19: 0x0000000115eb847d _ChipDeviceCtrl.so`chip::SecureSessionMgr::ExpirePairing(this=0x00007fe41a040800, session=(mPeerNodeId = 12344321, mPeerKeyId = 0, mAdmin = 0)) at SecureSessionMgr.cpp:222:26 frame #20: 0x0000000115e4e56b _ChipDeviceCtrl.so`chip::Controller::Device::OperationalCertProvisioned(this=0x00007fe3f0008028) at CHIPDevice.cpp:517:26 frame #21: 0x0000000115e55795 _ChipDeviceCtrl.so`chip::Controller::DeviceCommissioner::OperationalDiscoveryComplete(this=0x00007fe3f0008000, remoteDeviceId=12344321) at CHIPDeviceController.cpp:1081:13 frame #22: 0x0000000115e56fd9 _ChipDeviceCtrl.so`chip::Controller::DeviceCommissioner::OnNodeIdResolved(this=0x00007fe3f0008000, nodeData=0x000070000c027118) at CHIPDeviceController.cpp:1538:5 frame #23: 0x0000000115e7b011 _ChipDeviceCtrl.so`chip::Mdns::DiscoveryImplPlatform::HandleNodeIdResolve(context=0x0000000115efaa38, result=0x000070000c027238, error=0) at Discovery_ImplPlatform.cpp:521:29 frame #24: 0x0000000115e92ce6 _ChipDeviceCtrl.so`chip::Mdns::OnGetAddrInfo(sdRef=0x00007fe3e9e040f0, flags=2, interfaceId=6, err=0, hostname="FCF5C430E4E4.local.", address=0x000070000c0272e0, ttl=120, context=0x00007fe3e9e04080) at MdnsImpl.cpp:394:5 frame #25: 0x00007fff674b7032 libsystem_dnssd.dylib`handle_addrinfo_response + 553 frame #26: 0x00007fff674b6b5d libsystem_dnssd.dylib`DNSServiceProcessResult + 674 ```

which I suspect corresponds to this part of the log:

[1624924111550] [0x3df13a] CHIP: [CTL] Finding node on operational network
[1624924111550] [0x3df13a] CHIP: [IN] Encrypted message 0x70000c026b70 from 0x000000000001B669 to 0x0000000000BC5C01 of type 16 and protocolId 0 on exchange 44255.
[1624924111550] [0x3df13a] CHIP: [IN] Sending msg 0x70000c026b70 to 0x0000000000BC5C01 at utc time: 404729692 msec
[1624924111550] [0x3df13a] CHIP: [IN] Sending secure msg on generic transport
[1624924111550] [0x3df13a] CHIP: [IN] Secure msg send status No Error
[1624924112163] [0x3df13a] CHIP: [DIS] Node ID resolved for 0x0000000000BC5C01 to fe80::fef5:c4ff:fe30:e4e4
[1624924112163] [0x3df13a] CHIP: [CTL] Calling commissioning complete
[1624924112164] [0x3df13a] CHIP: [IN] Encrypted message 0x7fe41a02fe58 from 0x000000000001B669 to 0x0000000000BC5C01 of type 8 and protocolId 5 on exchange 44256.
[1624924112164] [0x3df13a] CHIP: [IN] Sending msg 0x7fe41a02fe58 to 0x0000000000BC5C01 at utc time: 404730305 msec
[1624924112164] [0x3df13a] CHIP: [IN] Sending secure msg on generic transport
[1624924112164] [0x3df13a] CHIP: [IN] Secure msg send status No Error
Node address has been updated
[1624924112164] [0x3df13a] CHIP: [CTL] OperationalDiscoveryComplete for device ID 12344321
Device temporary node id (**this does not match spec**): 12344321
[1624924112164] [0x3df13a] CHIP: [DMG] Time out! failed to receive invoke command response from Exchange: 44256
[1624924112164] [0x3df13a] CHIP: [ZCL] DefaultResponse:
[1624924112164] [0x3df13a] CHIP: [ZCL]   Transaction: 0x115ef71f0
[1624924112164] [0x3df13a] CHIP: [ZCL]   status: EMBER_ZCL_STATUS_FAILURE (0x01)
[1624924112164] [0x3df13a] CHIP: [CTL] Received failure response 1

So DeviceCommissioner::OperationalDiscoveryComplete calls device->OperationalCertProvisioned() which winds via the callstack above and ends up resetting that very device object (already questionable). But then we call PersistDevice(device); with a now-reset device, which is no good.

What I don't understand is why this does not bite our other controllers, the ones we test in CI...

bzbarsky-apple commented 3 years ago

So with chip-tool what happens is that we get DeviceCommissioner::OnOperationalCertificateAddResponse before we get OnNodeIdResolved. That calls into DeviceCommissioner::OnOperationalCredentialsProvisioningCompletion which calls DeviceCommissioner::RendezvousCleanup. That's because in that case CONFIG_USE_CLUSTERS_FOR_IP_COMMISSIONING is not defined (why?) so we effectively take the !mIsIPRendezvous path. In chip-device-ctrl, this is defined by default (at least as built by ./scripts/build_python.sh, so there we take the AdvanceCommissioningStage path instead of doing a RendezvousCleanup.

If I compile chip-tool with chip_use_clusters_for_ip_commissioning=true I get the same crash there.

So some obvious questions:

Should we still have the CONFIG_USE_CLUSTERS_FOR_IP_COMMISSIONING define? Or should that just always be true? @cecille
It seems like the real issue is that we send the CommissioningComplete command under DeviceCommissioner::OnNodeIdResolved before we call OperationalDiscoveryComplete and start tearing things down. It seems to me like we should sort out our switch to the operational network and only then do OperationalDiscoveryComplete. That's effectively what we end up doing in the !CONFIG_USE_CLUSTERS_FOR_IP_COMMISSIONING case, by doing the RendezvousCleanup a lot earlier in the process....
DeviceCommissioner::OperationalDiscoveryComplete should perhaps check that device->OperationalCertProvisioned() does not tear down the device for some other silly reason before trying to persist it? I guess if we had had that we would not have noticed the ordering issue item 2 describes. Also, I still find it really weird to have a Device method that ends up destroying the Device.....

bzbarsky-apple commented 3 years ago

OK, so I tried the fix from my question 2 combined with compiling chip-tool with chip_use_clusters_for_ip_commissioning=true. The result ends up doing this:

[1624927763705] [0x3f101f] CHIP: [CTL] Enabling CASE session establishment for the device
[1624927763705] [0x3f101f] CHIP: [IN] Connection from 'UDP:fe80::fef5:c4ff:fe30:e4e4:11097' expired
[1624927763707] [0x3f101f] CHIP: [SC] Sent SigmaR1 msg
[1624927763707] [0x3f101f] CHIP: [CTL] Calling commissioning complete
[1624927763707] [0x3f101f] CHIP: [DMG] ICR moving to [Initialize]
[1624927763707] [0x3f101f] CHIP: [DMG] ICR moving to [AddCommand]
[1624927763707] [0x3f101f] CHIP: [-] CHIP Error 4072 (0x00000FE8): Not connected at ../../../examples/chip-tool/third_party/connectedhomeip/src/app/CommandSender.cpp:75
[1624927763707] [0x3f101f] CHIP: [DMG] ICR moving to [Uninitiali]
[1624927764111] [0x3f100e] CHIP: [DL] Mdns: OnRegister name: 0000000000000000-0000000000000000, type: _chip._tcp., domain: local., flags: 2
[1624927764363] [0x3f100e] CHIP: [DL] Mdns: OnRegister name: 0000000000000000-0000000000BC5C01, type: _chip._tcp., domain: local., flags: 2
[1624927764363] [0x3f100e] CHIP: [DL] Mdns: OnRegister name: 0000000000000000-0000000000BC5C01, type: _chip._tcp., domain: local., flags: 2
[1624927768676] [0x3f1020] CHIP: [EM] Retransmit MsgId:00000000 Send Cnt 1
[1624927768693] [0x3f100f] CHIP: [DIS] Commissioning errored out with error 4050

which is not surprising. DeviceCommissioner::OnNodeIdResolved is calling AdvanceCommissioningStage immediately, which tries to send the CommissioningComplete command immediately, before waiting for CASE setup.

I also just checked and if I compile chip-tool with chip_use_clusters_for_ip_commissioning=false then it never reaches the "Calling commissioning complete" log line that corresponds to advancing to the kSendComplete stage. Which is clearly not a spec-compliant thing to be doing....

Also, I am confused by the semantics of the DevicePairingDelegate::SecurePairingSuccess status update. This seems to be sent in the following cases:

If CONFIG_USE_CLUSTERS_FOR_IP_COMMISSIONING is false or if we're doing non-IP rendezvous: right after OnOperationalCredentialsProvisioningCompletion, which is way before we're ready to start doing anything operational; at this point the consumer still needs to configure and enable networks (for the non-IP-rendezvous) case, etc.
If CONFIG_USE_CLUSTERS_FOR_IP_COMMISSIONING is true and we are doing IP rendezvous, when we reach the kCleanup stage, which is after we have gotten a response back for the CommissioningComplete command.

I don't see how consumers can sanely make sense of that... We really need to have our notifications come in some deterministic order and need our state machine to not look wildly different in these different cases. And ideally have fewer cases.

project-chip / connectedhomeip

Crash doing onnetwork connection with chip-device-ctrl #7982

Problem

Proposed Solution