ros2 / rclpy

rclpy (ROS Client Library for Python)
Apache License 2.0
291 stars 224 forks source link

`rcl_shutdown` is not thread-safe when used in both `signal_handler` and `Context.__exit__`. #1352

Open YuanYuYuan opened 4 weeks ago

YuanYuYuan commented 4 weeks ago

Bug report

We found this sporadic failure with rclpy (rolling) due to the race condition while calling rcl_shutdown. In this issue, the conflict happens if rmw_shutdown is slow so that rcutils_atomic_store on Thread 1 is set after the check rcl_context_is_valid on Thread 2. Therefore rcl_shutdown would be called twice and cause an error.

Analysis

Required Info:

Barry-Xu-2018 commented 4 weeks ago

After checking code, there is a issue.

Thread 1
Before calling rcl_shutdown(), g_contexts_mutex is locked.

https://github.com/ros2/rclpy/blob/464a357a7d9e39a12aed966fa3ccb44d50bb6816/rclpy/src/rclpy/context.cpp#L42-L53

Thread 2

Before calling rcl_shutdown(), g_contexts_mutex isn't used.

https://github.com/ros2/rclpy/blob/464a357a7d9e39a12aed966fa3ccb44d50bb6816/rclpy/src/rclpy/context.cpp#L150-L165

If thread 2 executes rcl_shutdown first, thread 1 can still call rcl_shutdown again.
So rcl_shutdown should be safeguarded by g_contexts_mutex in Context::shutdown().

I create a fix https://github.com/ros2/rclpy/pull/1353. Could you test in your environment to see if the issue no longer occurs with this PR ?

YuanYuYuan commented 4 weeks ago

Hi @Barry-Xu-2018! Thanks for your attempt. But it seemed not to work...

Barry-Xu-2018 commented 4 weeks ago

My fixing only prevented simultaneous calls to rcl_shutdown(). Another problem is that repeated calls to shutdown() cause an exception.

I have updated fixing. Please try again.

Currently, rclpy is designed to throw an exception if rcl_shutdown() is called multiple times on the same context (there's a specific test case for this). So, the error is expected behavior. However, in your situation, it shouldn't throw an error, but rather ignore the second shutdown call.
I will discuss with other members on how to fix this problem.

YuanYuYuan commented 4 weeks ago

Hi @Barry-Xu-2018, I have confirmed your latest fix resolves the issue (no more duplicated rcl_shutdown). Thanks!