Open schornakj opened 1 year ago
There's a gap in the stack trace between rclcpp::Executor::wait_for_work
and rclcpp::detail::add_guard_condition_to_rcl_wait_set
, so I searched through the rclcpp code to learn more and I think I found specific place in rclcpp::Executor::wait_for_work
where this originates:
There are a few places where add_guard_condition_to_rcl_wait_set
is used, but the one that seems relevant to the executor is this usage inAllocatorMemoryStrategy::add_handles_to_wait_set
:
for (auto guard_condition : guard_conditions_) {
detail::add_guard_condition_to_rcl_wait_set(*wait_set, *guard_condition);
}
This would end up being called here in Executor::wait_for_work
:
if (!memory_strategy_->add_handles_to_wait_set(&wait_set_)) {
throw std::runtime_error("Couldn't fill wait set");
}
In this case if add_guard_condition_to_rcl_wait_set
throws an exception, it doesn't get caught in any of the functions I linked, which means it propagates back up to the executor and results in the stack trace I originally posted.
shared_ptr wasn't used here. So this problem occurs.
I don't know why not use. Maybe there is some reason.
shared_ptr wasn't used here. So this problem occurs.
I have an update in flight to the executors that should make the handling of the guard conditions more thread-safe. I would be interested if you can reproduce with this PR @schornakj https://github.com/ros2/rclcpp/pull/2142
shared_ptr wasn't used here. So this problem occurs. I don't know why not use. Maybe there is some reason.
@Barry-Xu-2018 can you elaborate a bit here? i think that get_next_executable
is protected by wait_mutex_
in MultiThreadedExecutor
. Why(How?) having guard_conditions_
as shared pointer can solve this problem?
@fujitatomoya
The problem is related to use rclcpp::GuardCondition *
for guardconditions.
Guard condition for node is defined at
While executing below codes,
Before calling add_handles_to_wait_set(), node may be destroyed.
So rclcpp::GuardCondition *
in guardconditions is invalid.
This leads to error while calling this function.
In the new executor structures (https://github.com/ros2/rclcpp/pull/2143/files), I have worked around this by making the node return a rclcpp::GuardCondition::SharedPtr
, which makes it more consistent with the CallbackGroup
API.
These guard conditions are all added to a single waitable and held as weak pointers, which are locked right before adding to the rcl_wait_set so that we can ensure they are still valid: https://github.com/ros2/rclcpp/blob/mjcarroll/executor_structures/rclcpp/src/rclcpp/executors/executor_notify_waitable.cpp#L46-L64
@mjcarroll @Barry-Xu-2018 thank you very much for the explanation 👍
shared_ptr wasn't used here. So this problem occurs.
I have an update in flight to the executors that should make the handling of the guard conditions more thread-safe. I would be interested if you can reproduce with this PR @schornakj #2142
Cool! That looks like a good set of changes to a very important part of the core ROS 2 functionality.
This will be somewhat tricky for me to replicate on my robots, though, since I'm currently using a specific version of Humble and I'd need to work out some sort of ad-hoc Rolling-based deployment.
I've been working on isolating this issue, and I think it's related to creating or destroying a Subscriber in one thread while waiting on an action result future in a different thread.
I'll try to create a reproducible example now that I have a clearer idea of what's going on.
I'll try to create a reproducible example now that I have a clearer idea of what's going on.
Yeah, that would be fantastic. It would be good to have a test that shows this problem, and thus shows that @mjcarroll 's fixes actually fix it (hopefully).
Hi,
If I can help, I was doing a tutorial and found this problem. The error is reproducible, which may help solve the problem by providing a testbed for this issue.
https://github.com/fmrico/vector_transmission
In different terminals
ros2 run rclcpp_components component_container
ros2 component load /ComponentManager vector_transmission vector_transmission::VectorProducer
ros2 component load /ComponentManager vector_transmission vector_transmission::VectorConsumer
The error always happens when unloading the Consumer:
ros2 component unload /ComponentManager 2
I hope this helps!!
@fmrico thank you for sharing that! I've been struggling to come up with a minimal reproducible example that's independent of my project, and your test helps narrow down the problem a lot.
I modified the VectorConsumer node in your example to remove everything except for a single subscriber and the crash still happens. I think that shows that the specific cause of this crash is destroying the subscriber.
Hey folks -- I just tried @fmrico's repo and reproduction steps on the latest Rolling binaries and could still reproduce the exception.
@sea-bass the latest rolling binaries aren't expected to fix this.
I tried @fmrico steps https://github.com/ros2/rclcpp/issues/2163#issuecomment-1583036356 to reproduce with the open PR https://github.com/ros2/rclcpp/pull/2142 from @mjcarroll and this seems to work.
~/ros/r/rclcpp$ vcs status
..
=== ./src/rclcpp (git) ===
On branch mjcarroll/rclcpp_waitset_executor
Your branch is up to date with 'origin/mjcarroll/rclcpp_waitset_executor'.
nothing to commit, working tree clean
=== ./src/vector_transmission (git) ===
On branch main
Your branch is up to date with 'origin/main'.
I'm also running into this segfault on rclpy
on Humble, by the way. This is when using the rosbridge_suite
stack to advertise/unadvertise an action server. The stack trace is:
[rosbridge_websocket-2] ERROR:tornado.application:Exception in callback <function main.<locals>.<lambda> at 0x7f83e0230b80>
[rosbridge_websocket-2] Traceback (most recent call last):
[rosbridge_websocket-2] File "/usr/lib/python3/dist-packages/tornado/ioloop.py", line 905, in _run
[rosbridge_websocket-2] return self.callback()
[rosbridge_websocket-2] File "/opt/underlay_ws/install/rosbridge_server/lib/rosbridge_server/rosbridge_websocket", line 340, in <lambda>
[rosbridge_websocket-2] spin_callback = PeriodicCallback(lambda: executor.spin_once(timeout_sec=0.01), 1)
[rosbridge_websocket-2] File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 705, in spin_once
[rosbridge_websocket-2] handler, entity, node = self.wait_for_ready_callbacks(timeout_sec=timeout_sec)
[rosbridge_websocket-2] File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 691, in wait_for_ready_callbacks
[rosbridge_websocket-2] return next(self._cb_iter)
[rosbridge_websocket-2] File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 585, in _wait_for_ready_callbacks
[rosbridge_websocket-2] waitable.add_to_wait_set(wait_set)
[rosbridge_websocket-2] File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/action/server.py", line 499, in add_to_wait_set
[rosbridge_websocket-2] self._handle.add_to_waitset(wait_set)
[rosbridge_websocket-2] rclpy._rclpy_pybind11.RCLError: Failed to add 'rcl_action_server_t' to wait set: action server pointer is invalid, at ./src/rcl_action/action_server.c:929
This issue has been mentioned on ROS Discourse. There might be relevant details there:
Bug report
Required Info:
2023-01-27
sync snapshotSteps to reproduce issue
Calling
rclcpp_action::Client::async_get_result()
afterrclcpp_action::Client::async_send_goal()
sometimes throwsrclcpp::exceptions::RCLError
with a "failed to add guard condition to wait set: guard condition implementation is invalid" message.While I don't have a totally-atomic code sample that reproduces this issue independent of my project, the code in question is similar to this:
Expected behavior
If the goal_handle future returned from
async_send_goal()
is set (indicating that the goal request was accepted), then I should always be able to use the goal_handle to wait for the action result.Actual behavior
For some small proportion of action goals (on the order of 1 in 100), an exception is thrown:
Additional information
Based on the stack trace, this looks like a problem at the intersection of the rclcpp action client and the MultiThreadedExecutor, which is a place where I've had issues roughly similar to this one in the past.