orocos-toolchain / rtt

Orocos Real-Time Toolkit
http://www.orocos.org
Other
72 stars 79 forks source link

[WIP] fix deadlock in concurrent connection disconnection #302

Closed doudou closed 2 years ago

doudou commented 5 years ago

This PR aims at fixing the issue identified in #300. The deadlock happens essentially when disconnecting the same ports from two directions. The easiest way is to disconnect the same connection from both directions at the same time, which leads reliably to deadlocking (associated test in https://github.com/orocos-toolchain/rtt/commit/bb6fbc14306362e1e92c1160c58058bc2492ac34).

So far, the PR fixed the few easy places. However, there are a few very large codepaths still executed under lock, as removing the lock was breaking the tests without an obvious reason from my side. I figured that you guys might have more insight.

In general, my gut feeling is that the general design should be to connect/disconnect a channel from a port under the port lock, but wait until the lock is released to actually dismantle the channel. This would ensure that the large latency part (hitting the transports) is done without affecting the component.

meyerj commented 5 years ago

Thanks for your patch!

The new problem introduced with #114 is that some buffers might only have to be created once per input port or once per output port. The first implementation of the new dataflow semantics was implemented without that port-wide connection lock, but that caused race conditions when connecting concurrently because the new shared buffer or data object instances were eventually created by two connections, such that one of them would end up in a de-facto disconnected state. This was fixed by https://github.com/orocos-toolchain/rtt/pull/283, which introduced the per-port connection locks.

The ConnectionManager actually plays a subordinate now with that patch: It still keeps track of the port's connection, but is not involved at all in the actual dataflow anymore. It is purely for management purposes.

Same issue here: Without that lock it can happen that one thread disconnects the port resulting in the destruction of a shared buffer (ConnInputEndPoint.hpp:91-95), while another thread creates a new connection. That also could lead to race conditions and a channel pipeline not consistent with the port connections stored in the ConnectionManager.

In general, my gut feeling is that the general design should be to connect/disconnect a channel from a port under the port lock, but wait until the lock is released to actually dismantle the channel. This would ensure that the large latency part (hitting the transports) is done without affecting the component.

I agree. Your patch looks good at first sight. I will do some more tests and check why it fails to build on Travis.