project-chip / connectedhomeip

Matter (formerly Project CHIP) creates more connections between more objects, simplifying development for manufacturers and increasing compatibility for consumers, guided by the Connectivity Standards Alliance.
https://buildwithmatter.com
Apache License 2.0
7.23k stars 1.92k forks source link

[1.3] "chip-tool interactive server" disconnects clients and is prone to crashing. #31357

Closed nmtoan91 closed 2 months ago

nmtoan91 commented 6 months ago

Reproduction steps

Hi, I'm using the chip-tool interactive server and I noticed it has two issues.

  1. It auto disconnects my client after server calls, sometimes it takes 20-30 minutes to disconnect the client. Here are the log when it disconnects.
    [1704947446.834515][287080:287080] CHIP:TOO: LWS_CALLBACK_SERVER_WRITEABLE
    [1704947447.170357][287080:287080] CHIP:TOO: LWS_CALLBACK_CLOSED
    [1704947447.170378][287080:287080] CHIP:TOO: LWS_CALLBACK_WSI_DESTROY
  2. Sometimes the chip-tool crashes after serving for several hours.

Bug prevalence

all the time

GitHub hash of the SDK that was being used

latest

Platform

darwin

Platform Version(s)

1.3

Type

Core SDK Crash

Anything else?

No response

bzbarsky-apple commented 6 months ago

The websocket connection has a timeout. If nothing happens for a while it auto-closes, I believe...

nmtoan91 commented 6 months ago

I'll check the timeout later. But it does crash, the log is similar this:

[1705017763.602824][287080:287080] CHIP:TOO: LWS_CALLBACK_CLOSED
[1705017763.602864][287080:287080] CHIP:TOO: LWS_CALLBACK_WSI_DESTROY
Segmentation fault (core dumped)
bzbarsky-apple commented 6 months ago

@nmtoan91 Can you run under a debugger and get a backtrace?

nmtoan91 commented 6 months ago

I'm sorry, is there any instruction for runninh chip-tool under a debugger?

bzbarsky-apple commented 6 months ago

Like any other program? gdb chip-tool -- interactive server or lldb chip-tool -- interactive server or whatever your debugger of choice is...

nmtoan91 commented 6 months ago

@bzbarsky-apple here is the trace when crash

Thread 1 "chip-tool" received signal SIGSEGV, Segmentation fault.
0x0000555555c7e9fe in lws_callback_on_writable (wsi=0x0) at ../connectedhomeip/examples/chip-tool/third_party/connectedhomeip/third_party/libwebsockets/repo/lib/core-net/pollfd.c:526
526             if (lwsi_state(wsi) == LRS_SHUTDOWN)
(gdb) bt
#0  0x0000555555c7e9fe in lws_callback_on_writable (wsi=0x0)
    at ../connectedhomeip/examples/chip-tool/third_party/connectedhomeip/third_party/libwebsockets/repo/lib/core-net/pollfd.c:526
#1  0x0000555556077761 in WebSocketServer::Run(chip::Optional<unsigned short>, WebSocketServerDelegate*)
    (this=0x555556190788, port=..., delegate=0x555556190778)
    at ../connectedhomeip/examples/chip-tool/third_party/connectedhomeip/examples/common/websocket-server/WebSocketServer.cpp:192
#2  0x0000555555e03e17 in InteractiveServerCommand::RunCommand() (this=0x555556190260)
    at ../connectedhomeip/examples/chip-tool/commands/interactive/InteractiveCommands.cpp:310
#3  0x0000555555dd6f41 in CHIPCommand::StartWaiting(std::chrono::duration<unsigned int, std::ratio<1l, 1000l> >) (this=0x555556190260, duration=...)
    at ../connectedhomeip/examples/chip-tool/commands/common/CHIPCommand.cpp:594
#4  0x0000555555dd4aaf in CHIPCommand::Run() (this=0x555556190260) at ../connectedhomeip/examples/chip-tool/commands/common/CHIPCommand.cpp:239
#5  0x0000555555dec694 in Commands::RunCommand(int, char**, bool, chip::Optional<char*> const&, bool)
    (this=0x7fffffffdbf0, argc=3, argv=0x7fffffffe2e8, interactive=false, interactiveStorageDirectory=..., interactiveAdvertiseOperational=false)
    at ../connectedhomeip/examples/chip-tool/commands/common/Commands.cpp:331
#6  0x0000555555deb149 in Commands::Run(int, char**) (this=0x7fffffffdbf0, argc=3, argv=0x7fffffffe2e8)
    at ../connectedhomeip/examples/chip-tool/commands/common/Commands.cpp:178
#7  0x00005555556e71f6 in main(int, char**) (argc=3, argv=0x7fffffffe2e8) at ../connectedhomeip/examples/chip-tool/main.cpp:54
bzbarsky-apple commented 5 months ago

OK, so presumably the issue based on that stack trace is that wsi is null, right? And that's presumably because we got a LWS_CALLBACK_WSI_DESTROY message.

@nmtoan91 Does adding a null-check for wsi before calling lws_callback_on_writable there fix the crash?