project-chip / connectedhomeip

Matter (formerly Project CHIP) creates more connections between more objects, simplifying development for manufacturers and increasing compatibility for consumers, guided by the Connectivity Standards Alliance.
https://buildwithmatter.com
Apache License 2.0
7.18k stars 1.9k forks source link

[BUG] LwIP: Incorrect task when not using LWIP_TCPIP_CORE_LOCKING #28590

Open rojer opened 11 months ago

rojer commented 11 months ago

Reproduction steps

LwIP offers two modes to run TCP/IP stack: a dedicated task or in-line. In the former case, access to LwIP's state is serialized by requiring that all APIs be invoked from the dedicated task (throwing a closure on the task's work queue if necessary). In the latter, a mutex is used and it is sufficient to LOCK_TCPIP_CORE();. tcpip_api_call() function is provided by LwIP and will do the right thing, depending on whether LWIP_TCPIP_CORE_LOCKING is set or not. There are advantages and disadvantages to both approaches but the point is, both are supported by the stack and either one can be used by a particular platform. However, CHIP assumes that LWIP_TCPIP_CORE_LOCKING is used while not explicitly requiring it, which results in incorrect locking model being used if LWIP_TCPIP_CORE_LOCKING is not set (basically no serialization), and exposes applications to races within networking stack with unpredictable results.

Notably, ESP32 runs LwIP in dedicated task mode by default (LWIP_TCPIP_CORE_LOCKING is a config option but is not the default).

CHIP should wither assert that LWIP_TCPIP_CORE_LOCKING is enabled or (preferably) wrap LwIP API calls in tcpip_api_call.

Bug prevalence

Always

GitHub hash of the SDK that was being used

4f658f62aeced8b413f0f3bf9f5117897234a120

Platform

esp32

Platform Version(s)

No response

Anything else?

https://github.com/project-chip/connectedhomeip/blob/732f797f9baad2619675536ae5c5fd4913ddb032/src/inet/UDPEndPointImplLwIP.cpp#L70-L71 this assumes core locking is enabled, LOCK_TCPIP_CORE is a no-op otherwise.

tcpip_api_call should be used to synchronously invoke an API function, using locking or sync closure if necessary.

shubhamdp commented 10 months ago

@rojer I have enabled the static assert in https://github.com/project-chip/connectedhomeip/pull/28798 and skipped it for ASR and bouffalolab platforms. Please take a look.

shubhamdp commented 10 months ago

Here is one more PR from @tx2rx for ASR: https://github.com/project-chip/connectedhomeip/pull/28824

rojer commented 10 months ago

@shubhamdp that looks reasonable. if you need help testing non-core-locking case, i'd be happy to.

rojer commented 10 months ago

@shubhamdp i took a stab at it and https://github.com/project-chip/connectedhomeip/pull/29057 is the result. this is pretty straightforward and seems to be working for me so far.

A note on LwIPReceiveUDPMessage: it will almost certainly be invoked in the context of a different thread/task than CHIP's: with LWIP_TCPIP_CORE_LOCKING=0 it will be the TCPIP thread for sure, with LWIP_TCPIP_CORE_LOCKING=1 we don't really know, some other thread handling the interface RX (having a dedicated task for that is a common pattern this is what ESP32 does too, with separate tasks for wifi and eth). Granted, the method doesn't do much before posting the packet to the event queue, but even "not much" is something that could be racy, so I added Lock/UnlockChipStack() calls.

Correction: turns out, i cannot use PlatformManager from Inet due to a dependency cycle. So we should be locking the stack but looks like at present we can't. I'll need guidance on how to break this cycle...