project-chip / connectedhomeip

Matter (formerly Project CHIP) creates more connections between more objects, simplifying development for manufacturers and increasing compatibility for consumers, guided by the Connectivity Standards Alliance.
https://buildwithmatter.com
Apache License 2.0
7.42k stars 1.99k forks source link

[BUG] Too Many Network Interfaces Causes Segfault in chip-tool #27915

Open luke-ingle opened 1 year ago

luke-ingle commented 1 year ago

Reproduction steps

  1. Build chip-tool, by building Matter. https://github.com/project-chip/connectedhomeip/blob/master/docs/guides/BUILDING.md
  2. Configure multiple network interfaces (50)
  3. Run chip-tool and give it a command to perform out/host/chip-tool pairing ble-thread 101 hex:0e080000000000010000000300001635060004001fffe002088a4dee6232a203350708fd852c12168e7729051013c250b2ab003e9261ca98546f47739c030f4f70656e5468726561642d623630360102b6060410f2443479159316b89b4dbf67be0e7b230c0402a0f7f8 16544735 1852 --paa-trust-store-path credentials/production/paa-root-certs/

gdb output when hit seg fault: chip-tool-segfault.txt

Bug prevalence

Every time I have too many network interfaces

GitHub hash of the SDK that was being used

aaa25420f02d353c74677bef6e79abbd46eea463

Platform

other

Platform Version(s)

No response

Anything else?

I was previously working on Android Cuttlefish, and the emulator gave me a whole bunch of extra network interfaces.

I also work remotely so I have to use my company VPN.

If I reboot my laptop (after building chip-tool) and run chip-tool, it will work as expected. If I then log on to my VPN and try to run chip-tool, I get a segfault error. If I then disconnect from the VPN, and try again, it will work fine. I then tried removing the majority of the network interfaces, and just leaving my laptop's built in Network card, docker and company VPN. I then connected to the VPN, and ran chip-tool, and it ran fine.

So I am assuming there's an error around the network and finding a valid network connection if a user has multiple options?

List if interfaces are included in attached file list_of_network_interfaces.txt

bzbarsky-apple commented 1 year ago

@luke-ingle When it's crashing, what's the value of watch?

bzbarsky-apple commented 1 year ago

@luke-ingle Also, what does the backtrace look like? Who is calling StopWatchingSocket with what I assume is a garbage value (because we do check for null right before that line, so I assume watch is not null).

luke-ingle commented 1 year ago

@bzbarsky-apple I've attached the backtrace (I think. Haven't had a lot of experience using gdb before) It appears watch is not null. gdb_backtrace.txt

bzbarsky-apple commented 1 year ago

@luke-ingle Thank you, that is helpful. Just to make sure we are looking at the same things, what is line 266 of src/lib/dnssd/minimal_mdns/Server.cpp for you?

luke-ingle commented 1 year ago

No problems, this is what I have:

261 #if !CHIP_DEVICE_LAYER_NONE
262             chip::DeviceLayer::ChipDeviceEvent event{};
263             event.Type = chip::DeviceLayer::DeviceEventType::kDnssdInitialized;
264             chip::DeviceLayer::PlatformMgr().PostEventOrDie(&event);
265 #endif
266             mIsInitialized = true;
267         }
268     }
269 
270     return autoShutdown.ReturnSuccess();
bzbarsky-apple commented 1 year ago

OK, thank you.

Does applying this change:

diff --git a/src/inet/UDPEndPointImplSockets.cpp b/src/inet/UDPEndPointImplSockets.cpp
index 5c9748d0cf..98190a0e4d 100644
--- a/src/inet/UDPEndPointImplSockets.cpp
+++ b/src/inet/UDPEndPointImplSockets.cpp
@@ -469,7 +469,13 @@ CHIP_ERROR UDPEndPointImplSockets::GetSocket(IPAddressType addressType)
         {
             return CHIP_ERROR_POSIX(errno);
         }
-        ReturnErrorOnFailure(static_cast<System::LayerSockets *>(&GetSystemLayer())->StartWatchingSocket(mSocket, &mWatch));
+        CHIP_ERROR err = static_cast<System::LayerSockets *>(&GetSystemLayer())->StartWatchingSocket(mSocket, &mWatch);
+        if (err != CHIP_NO_ERROR) {
+            // Our mWatch is not valid; make sure we never use it.
+            close(mSocket);
+            mSocket = kInvalidSocketFd;
+            return err;
+        }

         mAddrType = addressType;

fix the crash for you?

luke-ingle commented 1 year ago

Thanks for the diff, that change doesn't "fix" the crash as such, but at least it now gives a more informative error message.

...
1689580747.101217][12010:12010] CHIP:TS: Last Known Good Time: 2023-07-12T10:24:00
[1689580747.101464][12010:12010] CHIP:ZCL: Using ZAP configuration...
[1689580747.144773][12010:12010] CHIP:-: src/system/SystemLayerImplSelect.cpp:358: CHIP Error 0x000000C1: Endpoint pool full at examples/chip-tool/commands/common/CHIPCommand.cpp:128
[1689580747.144783][12010:12010] CHIP:TOO: Run command failure: src/system/SystemLayerImplSelect.cpp:358: CHIP Error 0x000000C1: Endpoint pool full
[1689580747.154974][12010:12010] CHIP:SPT: VerifyOrDie failure at src/lib/support/Pool.h:337: Allocated() == 0
Aborted (core dumped)
bzbarsky-apple commented 1 year ago

@luke-ingle Thank you for checking that!

Sorry for the lag; I might not be able to get to this until Monday. But it sounds like there are multiple issues here...

bzbarsky-apple commented 1 year ago

@luke-ingle So I have been trying to reproduce, but setting a low INET_CONFIG_NUM_UDP_ENDPOINTS (6, in this case) on Mac. I do get failed startup and asserts about it, as somewhat expected, but I do not get the specific "Allocated() == 0" assertion failure you get. Can you figure out what pool that is, perhaps? What's the stack to that VerifyOrDie failure at src/lib/support/Pool.h:337 bit?

bzbarsky-apple commented 1 year ago

For now https://github.com/project-chip/connectedhomeip/pull/28245 with the above diff to at least not end up with an invalid memory reference.