ros2 / rclcpp

rclcpp (ROS Client Library for C++)
Apache License 2.0
514 stars 410 forks source link

fix: Fixed race condition in action server between is_ready and take"… #2531

Closed jmachowinski closed 5 days ago

jmachowinski commented 1 month ago

… (#2495)

Some background information: is_ready, take_data and execute data may be called from different threads in any order. The code in the old state expected them to be called in series, without interruption. This lead to multiple race conditions, as the state of the pimpl objects was altered by the three functions in a non thread safe way.

This is a clean backport of #2495. This should superseed https://github.com/ros2/rclcpp/pull/2530

Note, this patch is not tested.

@firesufer can you try out this patch ?

firesurfer commented 1 month ago

@jmachowinski I just tested it on my local setup and it worked fine so far. I will deploy it on our machine and checkout if it fixes the race condition issue.

jmachowinski commented 1 month ago

Hm, we run into a problem with this backport, as https://github.com/ros2/rclcpp/pull/2109 is one of the causes of this bug, and can not be reverted without API break.

As a workaround, I would propose to remove the exception, if take_data is called multiple times, and make execute() handle this case as well. Before going forward with this, I would like to hear you opinions on this @clalancette @wjwwood @mjcarroll

firesurfer commented 1 month ago

@jmachowinski Apparently this back port also does not fix the issue completely.

1715775009.7834265 [ros2_control_node-14] terminate called after throwing an instance of 'std::runtime_error' 1715775009.7838161 [ros2_control_node-14] what(): ServerBase::take_data() called but no data is ready 1715775009.7845201 [ros2_control_node-14] Stack trace (most recent call last) in thread 2061428: 1715775009.7851567 [ros2_control_node-14] #14 Object "", at 0xffffffffffffffff, in 1715775009.7882259 [ros2_control_node-14] #13 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f8392da484f, in 1715775009.7885315 [ros2_control_node-14] #12 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f8392d12ac2, in 1715775009.7904732 [ros2_control_node-14] #11 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f8392fa5252, in 1715775009.7919304 [ros2_control_node-14] #10 Object "/home/user/workspace/install/rclcpp/lib/librclcpp.so", at 0x7f83932b34d9, in rclcpp::executors::MultiThreadedExecutor::run(unsigned long) 1715775009.7929633 [ros2_control_node-14] #9 Object "/home/user/workspace/install/rclcpp/lib/librclcpp.so", at 0x7f83932a0f8c, in rclcpp::Executor::get_next_executable(rclcpp::AnyExecutable&, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) 1715775009.7959473 [ros2_control_node-14] #8 Object "/home/user/workspace/install/rclcpp/lib/librclcpp.so", at 0x7f8393299066, in rclcpp::Executor::get_next_ready_executable_from_map(rclcpp::AnyExecutable&, std::map<std::weak_ptr, std::weak_ptr, std::owner_less<std::weak_ptr >, std::allocator<std::pair<std::weak_ptr const, std::weak_ptr > > > const&) 1715775009.7967491 [ros2_control_node-14] #7 Object "/home/user/workspace/install/rclcpp_action/lib/librclcpp_action.so", at 0x7f8384326569, in rclcpp_action::ServerBase::take_data() [clone .cold] 1715775009.7971637 [ros2_control_node-14] #6 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f8392f774d7, in __cxa_throw 1715775009.7974977 [ros2_control_node-14] #5 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f8392f77276, in std::terminate() 1715775009.7977948 [ros2_control_node-14] #4 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f8392f7720b, in 1715775009.7980995 [ros2_control_node-14] #3 Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30", at 0x7f8392f6bb9d, in 1715775009.7983689 [ros2_control_node-14] #2 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f8392ca67f2, in abort 1715775009.7984428 [ros2_control_node-14] #1 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f8392cc0475, in raise 1715775009.7987077 [ros2_control_node-14] #0 Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7f8392d149fc, in pthread_kill 1715775009.7987759 [ros2_control_node-14] Aborted (Signal sent by tkill() 2061064 1000)

jmachowinski commented 1 month ago

@jmachowinski Apparently this back port also does not fix the issue completely.

Yes, this is the problem that I was talking about above.

mjcarroll commented 1 month ago

As a workaround, I would propose to remove the exception, if take_data is called multiple times, and make execute() handle this case as well.

Only for the backport or for rolling and jazzy as well?

jmachowinski commented 1 month ago

As a workaround, I would propose to remove the exception, if take_data is called multiple times, and make execute() handle this case as well.

Only for the backport or for rolling and jazzy as well?

Only for backport to Iron, for humble this patch should work as it is right now.

firesurfer commented 2 weeks ago

Whats the state with this PR? @mjcarroll @clalancette This bug is an absolute showstopper for using actions in any kind of industrial application and the fix should definitely be back ported from my point of view.

jmachowinski commented 1 week ago

@firesurfer I updated the patch, to ignore the double calls to take_data, can you test it ?

@clalancette @mjcarroll @wjwwood As this PR has been stale for some time without any response from you the attendees of the Client Working Group Meeting decided, that we will go for just ignoring the double calls to take_data and go on with this PR.

firesurfer commented 1 week ago

@jmachowinski Thanks for the patch. I'm rolling it out to our machine right now and will give you feedback this week.

jmachowinski commented 1 week ago

@firesurfer one second, I missed a part....

jmachowinski commented 1 week ago

Fixed, @firesurfer sorry for the confusion, now the patch should be good to go

firesurfer commented 1 week ago

@jmachowinski I guess the issue is fixed on the server side in rclcpp with this fix. On the client side with rclpy we still get a related error. But as I said my guess is that this is not related wit hthe action server side (ros2control - jtc) in this case.

[WARN] [1719316017.260343581] [.action_client]: Ignoring unexpected goal response. There may be more than one action server for the action '/follow_joint_trajectory'

EDIT: It seems this is also a race condition but in the action implementation in rclpy: https://github.com/ros2/rclpy/issues/1123

jmachowinski commented 1 week ago

@alsora @fujitatomoya I guess we are good to go. Can one of you run the CI ?

alsora commented 1 week ago
jmachowinski commented 1 week ago

@alsora I think you started the CI for the wrong OS / Distro...

fujitatomoya commented 1 week ago

I believe that this can keep the ABI compatibility but applying ABI compliance checker would be nice.

alsora commented 1 week ago

The builds failed due to a strange error in rmw_connextdds_common... All the 380 previous packages built fine.. https://ci.ros2.org/job/ci_linux/21319/consoleFull

The gist was created from https://github.com/ros2/ros2/blob/iron/ros2.repos and it seems correct to me.

jmachowinski commented 6 days ago

The gist was created from https://github.com/ros2/ros2/blob/iron/ros2.repos and it seems correct to me.

Here (https://ci.ros2.org/job/ci_linux/21319/) it says: ubuntu_distro: noble, el_release: 9, ros_distro: rolling,

Therefore I assumed the CI is off, as ubuntu should be Jammy and ros_disto: Iron...

alsora commented 6 days ago

New CI, this should be using Ubuntu Jammy and ROS 2 iron as a base

jmachowinski commented 5 days ago

@alsora @fujitatomoya merge ?

clalancette commented 5 days ago

RHEL shouldn't be failing, but it looks like an infrastructure issue. I'll kick it off again.

fujitatomoya commented 5 days ago

@alsora CI is green, i am okay to merge this. can you merge this with your lgtm?

alsora commented 5 days ago

Yes, the changes look good to me