Open schornakj opened 3 years ago
I think the problem is that goal_handle
hadn't been inserted yet while the feedback message came.
I think the problem is that goal_handle hadn't been inserted yet while the feedback message came.
so do i. this can be checked by changing logging level into debug
. (--ros-args --log-level debug
, https://docs.ros.org/en/foxy/Tutorials/Logging-and-logger-configuration.html)
small number of CPU cores (for example, 2 cores for Travis and GH Actions runners)
probably, server already sends back the feedback
after goal_response
message? it is expected that goal_response
would be taken 1st on client but if feeback
message is already delivered before taking goal_response
, feedback
message is the one to be taken at 1st on client.
how about at least taking goal_response
before feedback
on client when executable is available?
how about at least taking goal_response before feedback on client when executable is available?
I think it'd be better.
I have a question, would the receiving order be inconsistent with the sending order? If it really receives a feedback
message before goal_response
message, it'd still be a issue, right?
Edit: Recently i met a problem that was similar to my question. I sent a goal request, and called spin_until_future_complete
to spin, but the result_callback
was called before spin_until_future_complete
returned. I haven't opened a issue for it yet.
If it really receives a feedback message before goal_response message, it'd still be a issue, right?
at least server sends goal_response
before feedback
, but i do not think it guarantees the same order on client side.
I think the problem is that goal_handle hadn't been inserted yet while the feedback message came.
so do i. this can be checked by changing logging level into
debug
. (--ros-args --log-level debug
, https://docs.ros.org/en/foxy/Tutorials/Logging-and-logger-configuration.html)
If I set the logging level for the client node to DEBUG
, I do see the Received feedback for unknown goal. Ignoring...
error message. I also see a Received status for unknown goal. Ignoring...
message -- what does that mean in the context of communication between the action client and server?
[INFO] [launch]: All log files can be found below /home/jschornak/.ros/log/2021-05-24-12-30-52-607989-jschornak-PickNik-ThinkPad-P15-3642104
[INFO] [launch]: Default logging verbosity is set to INFO
[INFO] [fibonacci_action_server-1]: process started with pid [3642106]
[INFO] [test_node-2]: process started with pid [3642108]
[fibonacci_action_server-1] 1621873852.674547 [0] fibonacci_: using network interface wlp0s20f3 (udp/192.168.0.4) selected arbitrarily from: wlp0s20f3, docker0
[test_node-2] [==========] Running 1 test from 1 test suite.
[test_node-2] [----------] Global test environment set-up.
[test_node-2] [----------] 1 test from TestMulticoreCI
[test_node-2] [ RUN ] TestMulticoreCI.require_action_feedback
[test_node-2] 1621873852.678737 [0] test_fixtu: using network interface wlp0s20f3 (udp/192.168.0.4) selected arbitrarily from: wlp0s20f3, docker0
[test_node-2] [INFO] [1621873852.683964386] [test_node]: Setting severity threshold to DEBUG
[fibonacci_action_server-1] [INFO] [1621873855.683938429] [fibonacci_action_server]: Received goal request with order 5
[fibonacci_action_server-1] [INFO] [1621873855.684177294] [fibonacci_action_server]: Executing goal
[fibonacci_action_server-1] [INFO] [1621873855.684247222] [fibonacci_action_server]: Publish feedback
[test_node-2] [DEBUG] [1621873855.684374273] [test_node.rclcpp_action]: Received feedback for unknown goal. Ignoring...
[test_node-2] [DEBUG] [1621873855.684404149] [test_node.rclcpp_action]: Received status for unknown goal. Ignoring...
[fibonacci_action_server-1] [INFO] [1621873856.684338523] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1621873856.684414795] [test_node]: Have 1 state feedback messages
[fibonacci_action_server-1] [INFO] [1621873857.684864388] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1621873857.684915198] [test_node]: Have 2 state feedback messages
[fibonacci_action_server-1] [INFO] [1621873858.684417924] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1621873858.684452759] [test_node]: Have 3 state feedback messages
[fibonacci_action_server-1] [INFO] [1621873859.685108832] [fibonacci_action_server]: Goal succeeded
[test_node-2] [DEBUG] [1621873859.685558882] [test_node.rclcpp_action]: Received status for unknown goal. Ignoring...
[test_node-2] [ERROR] [1621873865.685394212] [test_node]: Timed out waiting for feedback vector to be populated
[test_node-2] terminate called without an active exception
[test_node-2] /home/jschornak/workspaces/test_ws/src/rclcpp_multicore_ci_test/src/test_fixture_node.cpp:114: Failure
[test_node-2] Value of: node->testActionFeebackCount()
[test_node-2] Actual: false
[test_node-2] Expected: true
[ERROR] [test_node-2]: process has died [pid 3642108, exit code -6, cmd 'build/rclcpp_multicore_ci_test/test_fixture_node'].
[INFO] [fibonacci_action_server-1]: sending signal 'SIGINT' to process[fibonacci_action_server-1]
[fibonacci_action_server-1] [INFO] [1621873865.837863207] [rclcpp]: signal_handler(signal_value=2)
[INFO] [fibonacci_action_server-1]: process has finished cleanly [pid 3642106]
@schornakj thanks for checking on this. that means client takes feedback
and status
messages before goal_response
. Is that possible to try https://github.com/ros2/rclcpp/pull/1684 to see if you can still meet the issue?
@fujitatomoya I need to set up a workspace to build the ROS2 main branches from source, but once I can do that I'll test your branch.
@fujitatomoya Sorry, I've gotten tied up in other work and ran into some problems building ROS2 from source, so I want to make sure I'm not a blocker to resolving this issue.
If anyone who can conveniently build the branch with the fix is able to try out my minimal example package and see if the tests succeed, I'm happy to use your results to see if this issue has been fixed.
@schornakj thanks for checking on this, and no pressure.
btw, i tried to reproduce the issue in my environment (AMD Ryzen 9 3900X 12-Core Processor) with source build, but so far i am unable to reproduce the problem.
root@134f29e8f25f:~/ros2_ws/colcon_ws# python3 -u src/ament/ament_cmake/ament_cmake_test/cmake/run_test.py /tmp/asd --package-name rclcpp_multicore_ci_test --env TEST_LAUNCH_DIR=src/ros2/rclcpp_multicore_ci_test/test TEST_EXECUTABLE=build/rclcpp_multicore_ci_test/test_fixture_node --command src/ros2/rclcpp_multicore_ci_test/test/test_nodes.py
...
[test_node-2] [INFO] [1622773201.325687275] [test_node]: Have 1 state feedback messages
[fibonacci_action_server-1] [INFO] [1622773202.325839068] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1622773202.326052630] [test_node]: Have 2 state feedback messages
[fibonacci_action_server-1] [INFO] [1622773203.325802212] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1622773203.326088555] [test_node]: Have 3 state feedback messages
[fibonacci_action_server-1] [INFO] [1622773204.325817075] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1622773204.326089248] [test_node]: Have 4 state feedback messages
[fibonacci_action_server-1] [INFO] [1622773205.326234247] [fibonacci_action_server]: Goal succeeded
[test_node-2] [ OK ] TestMulticoreCI.require_action_feedback (7217 ms)
[test_node-2] [----------] 1 test from TestMulticoreCI (7217 ms total)
[test_node-2]
[test_node-2] [----------] Global test environment tear-down
[test_node-2] [==========] 1 test from 1 test suite ran. (7217 ms total)
[test_node-2] [ PASSED ] 1 test.
@fujitatomoya What happens if you limit the number of cores you use to run the test? I did this by prepending taskset -c 0-1
when running the test to constrain it to only 2 cores. Based on your comment this command should be:
taskset -c 0-1 python3 -u src/ament/ament_cmake/ament_cmake_test/cmake/run_test.py /tmp/asd --package-name rclcpp_multicore_ci_test --env TEST_LAUNCH_DIR=src/ros2/rclcpp_multicore_ci_test/test TEST_EXECUTABLE=build/rclcpp_multicore_ci_test/test_fixture_node --command src/ros2/rclcpp_multicore_ci_test/test/test_nodes.py
In general I've encountered fewer test failures when using more and more cores, which I think is because there's a lower chance of the feedback callback handling being blocked. I also see few (if any) failures when using exactly one core, which I think is because the callbacks are always handled in the right order in this case.
It's a very interesting issue.
Even if aciton server rcl_action_send_goal_response
is called inside rclcpp_action
before user call goal_handle->publish_feedback(feedback);
at application, there might be exist three cases about which packet data come first.
case 1
[normal case, expectation] goal_response
data arrived, and then the executor call ClientBase::handle_goal_response
; then the feedback
data come after.
case 2
goal_response
and feedback
data are all stored into DDS buffer(socket function is faster than ros2 executor), the bug happened when the executor to deal with this waitable entity, the PR of @fujitatomoya can solve this case.
case 3
ROS2 can't guarantee which packet data come first. As we know, UDP is "unordered". There is no GUARANTEE that when sending packet A and then packet B to a client, the client will see first packet A and then packet B. It may see first B then A.
feedback
data arrived and is stored into DDS buffer, (at this time, the goal_response
not coming), the executor still to deal with this waitable entity, and then to call ClientBase::handle_feedback_message
, so the bug still happened.
I can't find an easy way (using the existing variable) to resolve it. The following steps might be a possible solution,
rclcpp::Waitable::can_take_data() { return true;}
rclcpp_action::ClientBase
and override rclcpp_action::ClientBase::can_take_data() { // use the flag; }
ClientBase::take_data
if any_executable.waitable->can_take_data()
before any_executable.data = any_executable.waitable->take_data();
in the executor.Mixing multiple entities into a Waitable object but the business logic caring about processing the order of entities, that's the problem.
@iuhilnehc-ynos thanks for checking on this, can you make this problem happen? so far, i just cannot see this problem, but in theory there is a few race condition we could have as you mentioned.
@fujitatomoya
can you make this problem happen?
No, I can't. (Just thinking in theory, sorry for making the noise.)
I think the main question is should every feedback_message
be received safely by the client? If not, perhaps this is not a real problem.:flushed:
I think the main question is should every
feedback_message
be received safely by the client? If not, perhaps this is not a real problem.:flushed:
If I publish a message I expect it to be successfully received on the other end, rather than being mysteriously consumed by the client waitable handling implementation.
In my project I'm intermittently publishing action feedback messages to give info about the state of action execution, and it's important that the entities monitoring those messages on the client side don't miss any feedback updates so they still get an accurate depiction of the action status. It's possible to refactor this to not use action feedback, but I think it's more important to make sure that these aspects of the action client actually do always work as expected.
but I think it's more important to make sure that these aspects of the action client actually do always work as expected.
@fujitatomoya Is there any design document to clarify the expected behaviors of the feedback_message
?
static const rmw_qos_profile_t rmw_qos_profile_default =
{
RMW_QOS_POLICY_HISTORY_KEEP_LAST,
10,
RMW_QOS_POLICY_RELIABILITY_RELIABLE,
RMW_QOS_POLICY_DURABILITY_VOLATILE,
RMW_QOS_DEADLINE_DEFAULT,
RMW_QOS_LIFESPAN_DEFAULT,
RMW_QOS_POLICY_LIVELINESS_SYSTEM_DEFAULT,
RMW_QOS_LIVELINESS_LEASE_DURATION_DEFAULT,
false
};
It currently using rmw_qos_profile_default
as the qos of feedback_message
. So with #1684, I think in most cases it would
work as you expected.
@iuhilnehc-ynos 's analysis (https://github.com/ros2/rclcpp/issues/1679#issuecomment-854401456) is pretty good, and the case 3 is complicated. In this case, you might miss the first few feedbacks or the last few feedbacks, or receive few disorder feedbacks. Actually I don't know the probability of its appearance, but I have not met this issue yet.
Is there any design document to clarify the expected behaviors of the feedback_message?
No i do not think so.
@ivanpauno @wjwwood would you share your thoughts on this?
There are some examples here:
https://design.ros2.org/articles/actions.html#clientserver-interaction-examples
But all of them indicate feedback is sent after the result request is received. I wonder if the intention is that you should not send feedback for a goal until the result request is received. This would ensure that feedback is never sent before the client has received the "goal requested" response.
CC @jacobperron, he might have more insight.
I think the problem is that goal_handle hadn't been inserted yet while the feedback message came.
Maybe the GoalHandle
should be inserted when sending the goal request and not when receiving the goal response to avoid the race.
I'm not completely sure if that's an okay solution or not.
Maybe the GoalHandle should be inserted when sending the goal request and not when receiving the goal response to avoid the race.
We should be careful with this, since I think the intention is to only track "accepted" goal handles. IIRC, if a goal is rejected, then we don't even create a GoalHandle object.
There are some examples here:
https://design.ros2.org/articles/actions.html#clientserver-interaction-examples
But all of them indicate feedback is sent after the result request is received.
I don't think we need to have sent a result request to get feedback.
As described earlier in this thread, I'm pretty sure the issue is the race between getting the GoalResponse from the action server and the first feedback message(s). Since a we don't create a goal handle until after receiving a GoalResponse, we can't update any feedback for it prior:
In my mind, I never considered losing a couple feedback messages to be a problem. But I can see how this may depend on the application.
A possible workaround would be to delay the publish of feedback messages from the action server until we can confirm that the goal response was received (I don't know if this is technically possible).
As @ivanpauno proposes, we could try to track goals before getting a GoalResponse, though I think this was originally avoided to keep the implementation "simple". I think the code architecture would have to change significantly to handle this case, thought I haven't investigated.
I don't think we need to have sent a result request to get feedback.
As described earlier in this thread, I'm pretty sure the issue is the race between getting the GoalResponse from the action server and the first feedback message(s).
Ah, I see, there's no requirement that the client request a response immediately. I was just assuming they would, but that isn't the case.
A possible workaround would be to delay the publish of feedback messages from the action server until we can confirm that the goal response was received (I don't know if this is technically possible).
I was thinking that if you ensure the client immediately requests a response then that could tell you for sure the client has received the goal response, but that might not be a good solution if the client needs to be doing other things in the meantime.
As @ivanpauno proposes, we could try to track goals before getting a GoalResponse, though I think this was originally avoided to keep the implementation "simple". I think the code architecture would have to change significantly to handle this case, thought I haven't investigated.
Right, this gets into queuing and figuring out how many feedback to store per goal and how long to keep them before deciding that the feedback for that goal should be discarded because the GoalResponse is never coming (either the server died or the service response was otherwise lost).
I was thinking that if you ensure the client immediately requests a response then that could tell you for sure the client has received the goal response, but that might not be a good solution if the client needs to be doing other things in the meantime.
I think you're suggesting that we use the result request as a signal that it's safe to start sending feedback. I think this would work, but assumes that the action client will send a result request right away. I'm pretty sure if a result callback is registered, then a result request is sent right away. Though sending a result request right away is probably the most common use case, it's possible for the action client to defer sending the result request until much later or not at all (e.g. I think it's possible to send a goal, get feedback, and just ignore the result).
In my mind, I never considered losing a couple feedback messages to be a problem.
So did I.
Right, this gets into queuing and figuring out how many feedback to store per goal and how long to keep them before deciding that the feedback for that goal should be discarded because the GoalResponse is never coming (either the server died or the service response was otherwise lost).
totally agree.
at least, i think https://github.com/ros2/rclcpp/pull/1684 should be applied to improve (but not guarantee), and probably we would want to reconsider the order to take data on client.
Really, we need to answer: should users be able to get guaranteed delivery of all feedback messages?
If the answer is yes, then I think a reasonable solution is to queue feedback messages (and maybe status messages) on the action client until it receives a goal response.
Queuing could result in a burst of feedback messages if there's a delay in receiving the goal response message or increased memory usage if the goal response is lost. I think the feedback queue size should be configurable by the user so that they have the option to keep the current behaviour (queue size 0) or tune the size to their application.
If the answer is yes, then I think a reasonable solution is to queue feedback messages (and maybe status messages) on the action client until it receives a goal response.
What if this is a race between getting the GoalResult
from the action server and the last feedback message(s)? Perhaps it's a unsolvable issue which caused by unordered messages.
What if this is a race between getting the GoalResult from the action server and the last feedback message(s)? Perhaps it's a unsolvable issue which caused by unordered messages.
For these kinds of things, I would like to have a way to indicate that order has to be respected (i.e. that the action server response to the goal request and the feedback topic writes should maintain order in the receiver side).
This is possible in DDS, by using the same DDS Publisher
in both data writers (the goal request response writer and the feedback writer) and using [GROUP PRESENTATION QOS] (https://community.rti.com/static/documentation/connext-dds/5.2.0/doc/manuals/connext_dds/html_files/RTI_ConnextDDS_CoreLibraries_UsersManual/Content/UsersManual/PRESENTATION_QosPolicy.htm).
Having an equivalent in ROS 2 API would be nice, but it isn't obvious. "ROS Publishers" are sort of equivalent to "DDS Data Writers" and we generally don't control the "DDS publishers" the rmw implementation uses (most of them are using only one for all the writers now, I think that some used to create a publisher per data writer).
Hi, I don't have a repeatable example for this, but want to highlight that I've seen this issue in an rclpy executor as well (using Humble and Cyclone DDS):
EDIT: I now believe this to have been caused by a bug on our rclpy code where the node was being spun in multiple threads by mistake - can probably be disregarded for the purposes of this conversation, but I'll leave the comment for posterity.
[Exception in thread Thread-1 (spin):
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/opt/atlassian/pipelines/agent/build/install/test_control/lib/python3.10/site-packages/test_control/test_robotgui.py", line 1334, in spin
self.executor.spin()
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 279, in spin
self.spin_once()
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 764, in spin_once
self._spin_once_impl(timeout_sec)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 761, in _spin_once_impl
future.result()
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/task.py", line 94, in result
raise self.exception()
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/task.py", line 239, in __call__
self._handler.send(None)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 418, in handler
await call_coroutine(entity, arg)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 372, in _execute_service
response = await await_or_execute(srv.callback, request, srv.srv_type.Response())
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 107, in await_or_execute
return callback(*args)
File "/opt/atlassian/pipelines/agent/build/install/test_control/lib/python3.10/site-packages/test_control/test_robotgui.py", line 627, in move_to_testing_pose_service_callback
self.move_robot_to_point(vel_scaling=vel_scaling,
File "/opt/atlassian/pipelines/agent/build/install/test_control/lib/python3.10/site-packages/test_control/test_robotgui.py", line 888, in move_robot_to_point
rclpy.spin_once(self.action_client_node)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/__init__.py", line 202, in spin_once
executor.spin_once(timeout_sec=timeout_sec)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 705, in spin_once
handler, entity, node = self.wait_for_ready_callbacks(timeout_sec=timeout_sec)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 691, in wait_for_ready_callbacks
return next(self._cb_iter)
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 610, in _wait_for_ready_callbacks
File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/action/client.py", line 231, in is_ready
ready_entities = self._client_handle.is_ready(wait_set)
clpy._rclpy_pybind11.RCLError: Failed to get number of ready entities for action client: wait set index for feedback subscription is out of bounds, at ./src/rcl_action/action_client.c:619
It happened while waiting for the status (also in CI, but Bitbucket Pipelines which I've read uses 4 cores instead of 2 but haven't confirmed). I'll keep an eye out for future occurrences.. hopefully it's contained to CI.
rclpy.spin_once(self.action_client_node)
Bug report
Required Info:
Expected behavior
My packages' integration tests should run with the same behavior both locally and in CI.
Actual behavior
When running in an environment with a comparatively small number of CPU cores (for example, 2 cores for Travis and GH Actions runners), my tests sometimes fail because action feedback that I require to pass the test was never received by my action client.
To test this, I'm using the
fibonacci_action_server
node from theaction_tutorials_cpp
package, which includes an action server of typeaction_tutorials_interfaces::action::Fibonacci
.My test node sends a goal to the Fibonacci action with
order = 5
. I expect the action server to publish 4 feedback messages. In the log output from my tests, I see that the server does publish 4 feedback messages, but the feedback callback registered with my action client is only called for the first 3 messages. Since my test needs to receive 4 feedback messages to succeed, it times out waiting for the 4th message and fails.I think issues similar to this are causing problems in a variety of ROS2 packages, such as during this PR for MoveIt2.
Steps to reproduce issue
I made a minimal package to reproduce this issue: https://github.com/schornakj/rclcpp_multicore_ci_test/
This mirrors how similar integration tests run in my CI pipelines.
More detailed instructions are in the repo's README.
Additional information
My dev workstation is a Lenovo Thinkpad P15 with an Intel i7-10750H CPU, which has 12 cores, and 16 GB RAM.
Source code for my test executables is below (and also in the repo I linked previously):
test_fixture_node.hpp
test_fixture_node.cpp
test_nodes.py