schornakj commented 3 years ago

Bug report

Required Info:

Operating System:
- Ubuntu 20.04 LTS
Installation type:
- Binaries
Version or commit hash:
- Foxy, Rolling
DDS implementation:
- CycloneDDS
Client library (if applicable):
- rclcpp

Expected behavior

My packages' integration tests should run with the same behavior both locally and in CI.

Actual behavior

When running in an environment with a comparatively small number of CPU cores (for example, 2 cores for Travis and GH Actions runners), my tests sometimes fail because action feedback that I require to pass the test was never received by my action client.

To test this, I'm using the fibonacci_action_server node from the action_tutorials_cpp package, which includes an action server of type action_tutorials_interfaces::action::Fibonacci.

My test node sends a goal to the Fibonacci action with order = 5. I expect the action server to publish 4 feedback messages. In the log output from my tests, I see that the server does publish 4 feedback messages, but the feedback callback registered with my action client is only called for the first 3 messages. Since my test needs to receive 4 feedback messages to succeed, it times out waiting for the 4th message and fails.

taskset -c 0-1 python3 -u /opt/ros/rolling/share/ament_cmake_test/cmake/run_test.py /tmp/asd \
  --package-name rclcpp_multicore_ci_test \
  --env TEST_LAUNCH_DIR=src/rclcpp_multicore_ci_test/test TEST_EXECUTABLE=build/rclcpp_multicore_ci_test/test_fixture_node \
  --command src/rclcpp_multicore_ci_test/test/test_nodes.py

[INFO] [fibonacci_action_server-1]: process started with pid [2053673]
[INFO] [test_node-2]: process started with pid [2053675]
[test_node-2] [==========] Running 1 test from 1 test suite.
[test_node-2] [----------] Global test environment set-up.
[test_node-2] [----------] 1 test from TaskSequenceInterface
[test_node-2] [ RUN      ] TaskSequenceInterface.setup_plan_exec_through_ros
[fibonacci_action_server-1] 1621445668.874821 [0] fibonacci_: using network interface wlp0s20f3 (udp/192.168.0.4) selected arbitrarily from: wlp0s20f3, docker0
[test_node-2] 1621445668.875303 [0] test_fixtu: using network interface wlp0s20f3 (udp/192.168.0.4) selected arbitrarily from: wlp0s20f3, docker0
[fibonacci_action_server-1] [INFO] [1621445668.891735102] [fibonacci_action_server]: Received goal request with order 5
[fibonacci_action_server-1] [INFO] [1621445668.892005547] [fibonacci_action_server]: Executing goal
[fibonacci_action_server-1] [INFO] [1621445668.892090501] [fibonacci_action_server]: Publish feedback
[fibonacci_action_server-1] [INFO] [1621445669.892521092] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1621445669.892611049] [test_node]: Have 1 state feedback messages
[fibonacci_action_server-1] [INFO] [1621445670.892371325] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1621445670.892701509] [test_node]: Have 2 state feedback messages
[fibonacci_action_server-1] [INFO] [1621445671.892378535] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1621445671.892631492] [test_node]: Have 3 state feedback messages
[fibonacci_action_server-1] [INFO] [1621445672.892589940] [fibonacci_action_server]: Goal succeeded
[test_node-2] [ERROR] [1621445678.893078911] [test_node]: Timed out waiting for feedback vector to be populated
[test_node-2] terminate called without an active exception
[test_node-2] /home/jschornak/workspaces/test_ws/src/rclcpp_multicore_ci_test/src/test_fixture_node.cpp:99: Failure
[test_node-2] Value of: node->testActionFeebackCount()
[test_node-2]   Actual: false
[test_node-2] Expected: true
[ERROR] [test_node-2]: process has died [pid 2053675, exit code -6, cmd 'build/rclcpp_multicore_ci_test/test_fixture_node'].
[INFO] [fibonacci_action_server-1]: sending signal 'SIGINT' to process[fibonacci_action_server-1]
[fibonacci_action_server-1] [INFO] [1621445679.071037031] [rclcpp]: signal_handler(signal_value=2)
[INFO] [fibonacci_action_server-1]: process has finished cleanly [pid 2053673]

I think issues similar to this are causing problems in a variety of ROS2 packages, such as during this PR for MoveIt2.

Steps to reproduce issue

I made a minimal package to reproduce this issue: https://github.com/schornakj/rclcpp_multicore_ci_test/

This mirrors how similar integration tests run in my CI pipelines.

More detailed instructions are in the repo's README.

Additional information

My dev workstation is a Lenovo Thinkpad P15 with an Intel i7-10750H CPU, which has 12 cores, and 16 GB RAM.

Source code for my test executables is below (and also in the repo I linked previously):

test_fixture_node.hpp

#pragma once

#include <rclcpp/rclcpp.hpp>
#include <rclcpp_action/rclcpp_action.hpp>
#include <action_tutorials_interfaces/action/fibonacci.hpp>

namespace rclcpp_multicore_ci_test
{
using action_tutorials_interfaces::action::Fibonacci;
using GoalHandleFibonacci = rclcpp_action::ClientGoalHandle<Fibonacci>;

class TestFixtureNode : public rclcpp::Node
{
public:
  TestFixtureNode();

  bool testActionFeebackCount();

private:
  rclcpp::CallbackGroup::SharedPtr cb_group_me_;
  rclcpp_action::Client<Fibonacci>::SharedPtr fib_client_;
};
}  // namespace rclcpp_multicore_ci_test

test_fixture_node.cpp

#include <rclcpp_thread_test/test_fixture_node.hpp>

#include <gtest/gtest.h>

static const std::size_t ORDER = 5;

namespace rclcpp_multicore_ci_test
{

TestFixtureNode::TestFixtureNode()
  : Node("test_node")
  , cb_group_me_(create_callback_group(rclcpp::CallbackGroupType::MutuallyExclusive))
  , fib_client_(rclcpp_action::create_client<Fibonacci>(this, "/fibonacci", cb_group_me_))
{
}

bool TestFixtureNode::testActionFeebackCount()
{ 
  if(!fib_client_->wait_for_action_server(std::chrono::seconds(3)))
  {
    RCLCPP_ERROR_STREAM(get_logger(), "Timed out waiting for action server to be available");
    return false;
  }

  auto options = rclcpp_action::Client<Fibonacci>::SendGoalOptions();
  options.goal_response_callback = [](std::shared_future<GoalHandleFibonacci::SharedPtr> future) -> void {
    (void)future;
  };

  std::mutex lk;
  std::vector<Fibonacci::Feedback> feedback;
  std::promise<void> promise;

  options.feedback_callback = [=, &feedback, &lk,
                               &promise](GoalHandleFibonacci::SharedPtr gh,
                                         std::shared_ptr<const Fibonacci::Feedback> fb) -> void {
    (void)gh;
    std::scoped_lock _(lk);
    feedback.push_back(*fb);
    RCLCPP_INFO_STREAM(get_logger(), "Have " << feedback.size() << " state feedback messages");
    if (feedback.size() == ORDER - 1)
    {
      promise.set_value();
    }
  };

  options.result_callback = [](const GoalHandleFibonacci::WrappedResult& result) -> void { (void)result; };

  auto goal = Fibonacci::Goal();
  goal.order = ORDER;

  auto fut = fib_client_->async_send_goal(goal, options);

  if (fut.wait_for(std::chrono::seconds(1)) == std::future_status::timeout)
  {
    RCLCPP_ERROR_STREAM(get_logger(), "Timed out waiting for goal response callback");
    return false;
  }

  auto gh = fut.get();
  auto res_fut = fib_client_->async_get_result(gh);
  if (res_fut.wait_for(std::chrono::seconds(ORDER + 1)) == std::future_status::timeout)
  {
    RCLCPP_ERROR_STREAM(get_logger(), "Timed out waiting for result callback");
    return false;
  }

  auto received_state_fut = promise.get_future();
  if (received_state_fut.wait_for(std::chrono::seconds(ORDER + 1)) == std::future_status::timeout)
  {
    RCLCPP_ERROR_STREAM(get_logger(), "Timed out waiting for feedback vector to be populated");
    return false;
  }

  if (feedback.size() != ORDER - 1)
  {
    RCLCPP_ERROR_STREAM(get_logger(), "Wrong number of feedback messages: expected " << ORDER - 1 <<  ", got " << feedback.size() << ".");
    return false;
  }

  return true;
}

}  // namespace rclcpp_multicore_ci_test

TEST(TaskSequenceInterface, setup_plan_exec_through_ros)
{
  auto node = std::make_shared<rclcpp_multicore_ci_test::TestFixtureNode>();

  rclcpp::executors::MultiThreadedExecutor exec;

  auto exec_thread = std::make_unique<std::thread>([&exec, &node]() {
    exec.add_node(node);
    exec.spin();
    exec.remove_node(node);
  });

  ASSERT_TRUE(node->testActionFeebackCount());

  exec.cancel();
  exec_thread->join();
}

int main(int argc, char* argv[])
{
  rclcpp::init(argc, argv);

  testing::InitGoogleTest(&argc, argv);

  rclcpp::NodeOptions options;

  bool all_successful = RUN_ALL_TESTS();

  rclcpp::shutdown();

  return all_successful;
}

test_nodes.py

#!/usr/bin/env python3

import os
import sys
from launch import LaunchService, LaunchDescription
from launch.actions import EmitEvent, ExecuteProcess, RegisterEventHandler
from launch.events import Shutdown
from launch.event_handlers import OnProcessExit
from launch_ros.actions import Node
from launch_testing.legacy import LaunchTestService

def main(argv=sys.argv[1:]):
    test_executable = os.getenv("TEST_EXECUTABLE")

    test_action = ExecuteProcess(
        cmd=[test_executable], name="test_node", output="screen"
    )

    fib_server_node = Node(
        package="action_tutorials_cpp",
        executable="fibonacci_action_server",
        name="fibonacci_action_server",
        output="screen",
    )

    ld = LaunchDescription([])
    ld.add_action(fib_server_node)

    lts = LaunchTestService()
    lts.add_test_action(ld, test_action)

    ls = LaunchService(argv=argv)
    ls.include_launch_description(ld)
    return lts.run(ls)

if __name__ == "__main__":
    sys.exit(main())

KavenYau commented 3 years ago

https://github.com/ros2/rclcpp/blob/1c617515403db16b8cbabd834073eaca0f9a452d/rclcpp_action/include/rclcpp_action/client.hpp#L468-L473

https://github.com/ros2/rclcpp/blob/1c617515403db16b8cbabd834073eaca0f9a452d/rclcpp_action/include/rclcpp_action/client.hpp#L664-L670

I think the problem is that goal_handle hadn't been inserted yet while the feedback message came.

fujitatomoya commented 3 years ago

I think the problem is that goal_handle hadn't been inserted yet while the feedback message came.

so do i. this can be checked by changing logging level into debug. (--ros-args --log-level debug, https://docs.ros.org/en/foxy/Tutorials/Logging-and-logger-configuration.html)

small number of CPU cores (for example, 2 cores for Travis and GH Actions runners)

probably, server already sends back the feedback after goal_response message? it is expected that goal_response would be taken 1st on client but if feeback message is already delivered before taking goal_response, feedback message is the one to be taken at 1st on client.

how about at least taking goal_response before feedback on client when executable is available?

KavenYau commented 3 years ago

how about at least taking goal_response before feedback on client when executable is available?

I think it'd be better.

I have a question, would the receiving order be inconsistent with the sending order? If it really receives a feedback message before goal_response message, it'd still be a issue, right?

Edit: Recently i met a problem that was similar to my question. I sent a goal request, and called spin_until_future_complete to spin, but the result_callback was called before spin_until_future_complete returned. I haven't opened a issue for it yet.

fujitatomoya commented 3 years ago

If it really receives a feedback message before goal_response message, it'd still be a issue, right?

at least server sends goal_response before feedback, but i do not think it guarantees the same order on client side.

schornakj commented 3 years ago

I think the problem is that goal_handle hadn't been inserted yet while the feedback message came.

so do i. this can be checked by changing logging level into debug. (--ros-args --log-level debug, https://docs.ros.org/en/foxy/Tutorials/Logging-and-logger-configuration.html)

If I set the logging level for the client node to DEBUG, I do see the Received feedback for unknown goal. Ignoring... error message. I also see a Received status for unknown goal. Ignoring... message -- what does that mean in the context of communication between the action client and server?

[INFO] [launch]: All log files can be found below /home/jschornak/.ros/log/2021-05-24-12-30-52-607989-jschornak-PickNik-ThinkPad-P15-3642104
[INFO] [launch]: Default logging verbosity is set to INFO
[INFO] [fibonacci_action_server-1]: process started with pid [3642106]
[INFO] [test_node-2]: process started with pid [3642108]
[fibonacci_action_server-1] 1621873852.674547 [0] fibonacci_: using network interface wlp0s20f3 (udp/192.168.0.4) selected arbitrarily from: wlp0s20f3, docker0
[test_node-2] [==========] Running 1 test from 1 test suite.
[test_node-2] [----------] Global test environment set-up.
[test_node-2] [----------] 1 test from TestMulticoreCI
[test_node-2] [ RUN      ] TestMulticoreCI.require_action_feedback
[test_node-2] 1621873852.678737 [0] test_fixtu: using network interface wlp0s20f3 (udp/192.168.0.4) selected arbitrarily from: wlp0s20f3, docker0
[test_node-2] [INFO] [1621873852.683964386] [test_node]: Setting severity threshold to DEBUG
[fibonacci_action_server-1] [INFO] [1621873855.683938429] [fibonacci_action_server]: Received goal request with order 5
[fibonacci_action_server-1] [INFO] [1621873855.684177294] [fibonacci_action_server]: Executing goal
[fibonacci_action_server-1] [INFO] [1621873855.684247222] [fibonacci_action_server]: Publish feedback
[test_node-2] [DEBUG] [1621873855.684374273] [test_node.rclcpp_action]: Received feedback for unknown goal. Ignoring...
[test_node-2] [DEBUG] [1621873855.684404149] [test_node.rclcpp_action]: Received status for unknown goal. Ignoring...
[fibonacci_action_server-1] [INFO] [1621873856.684338523] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1621873856.684414795] [test_node]: Have 1 state feedback messages
[fibonacci_action_server-1] [INFO] [1621873857.684864388] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1621873857.684915198] [test_node]: Have 2 state feedback messages
[fibonacci_action_server-1] [INFO] [1621873858.684417924] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1621873858.684452759] [test_node]: Have 3 state feedback messages
[fibonacci_action_server-1] [INFO] [1621873859.685108832] [fibonacci_action_server]: Goal succeeded
[test_node-2] [DEBUG] [1621873859.685558882] [test_node.rclcpp_action]: Received status for unknown goal. Ignoring...
[test_node-2] [ERROR] [1621873865.685394212] [test_node]: Timed out waiting for feedback vector to be populated
[test_node-2] terminate called without an active exception
[test_node-2] /home/jschornak/workspaces/test_ws/src/rclcpp_multicore_ci_test/src/test_fixture_node.cpp:114: Failure
[test_node-2] Value of: node->testActionFeebackCount()
[test_node-2]   Actual: false
[test_node-2] Expected: true
[ERROR] [test_node-2]: process has died [pid 3642108, exit code -6, cmd 'build/rclcpp_multicore_ci_test/test_fixture_node'].
[INFO] [fibonacci_action_server-1]: sending signal 'SIGINT' to process[fibonacci_action_server-1]
[fibonacci_action_server-1] [INFO] [1621873865.837863207] [rclcpp]: signal_handler(signal_value=2)
[INFO] [fibonacci_action_server-1]: process has finished cleanly [pid 3642106]

fujitatomoya commented 3 years ago

@schornakj thanks for checking on this. that means client takes feedback and status messages before goal_response. Is that possible to try https://github.com/ros2/rclcpp/pull/1684 to see if you can still meet the issue?

schornakj commented 3 years ago

@fujitatomoya I need to set up a workspace to build the ROS2 main branches from source, but once I can do that I'll test your branch.

schornakj commented 3 years ago

@fujitatomoya Sorry, I've gotten tied up in other work and ran into some problems building ROS2 from source, so I want to make sure I'm not a blocker to resolving this issue.

If anyone who can conveniently build the branch with the fix is able to try out my minimal example package and see if the tests succeed, I'm happy to use your results to see if this issue has been fixed.

fujitatomoya commented 3 years ago

@schornakj thanks for checking on this, and no pressure.

btw, i tried to reproduce the issue in my environment (AMD Ryzen 9 3900X 12-Core Processor) with source build, but so far i am unable to reproduce the problem.

root@134f29e8f25f:~/ros2_ws/colcon_ws# python3 -u src/ament/ament_cmake/ament_cmake_test/cmake/run_test.py /tmp/asd --package-name rclcpp_multicore_ci_test --env TEST_LAUNCH_DIR=src/ros2/rclcpp_multicore_ci_test/test TEST_EXECUTABLE=build/rclcpp_multicore_ci_test/test_fixture_node --command src/ros2/rclcpp_multicore_ci_test/test/test_nodes.py
...
[test_node-2] [INFO] [1622773201.325687275] [test_node]: Have 1 state feedback messages
[fibonacci_action_server-1] [INFO] [1622773202.325839068] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1622773202.326052630] [test_node]: Have 2 state feedback messages
[fibonacci_action_server-1] [INFO] [1622773203.325802212] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1622773203.326088555] [test_node]: Have 3 state feedback messages
[fibonacci_action_server-1] [INFO] [1622773204.325817075] [fibonacci_action_server]: Publish feedback
[test_node-2] [INFO] [1622773204.326089248] [test_node]: Have 4 state feedback messages
[fibonacci_action_server-1] [INFO] [1622773205.326234247] [fibonacci_action_server]: Goal succeeded
[test_node-2] [       OK ] TestMulticoreCI.require_action_feedback (7217 ms)
[test_node-2] [----------] 1 test from TestMulticoreCI (7217 ms total)
[test_node-2]
[test_node-2] [----------] Global test environment tear-down
[test_node-2] [==========] 1 test from 1 test suite ran. (7217 ms total)
[test_node-2] [  PASSED  ] 1 test.

schornakj commented 3 years ago

@fujitatomoya What happens if you limit the number of cores you use to run the test? I did this by prepending taskset -c 0-1 when running the test to constrain it to only 2 cores. Based on your comment this command should be:

taskset -c 0-1 python3 -u src/ament/ament_cmake/ament_cmake_test/cmake/run_test.py /tmp/asd --package-name rclcpp_multicore_ci_test --env TEST_LAUNCH_DIR=src/ros2/rclcpp_multicore_ci_test/test TEST_EXECUTABLE=build/rclcpp_multicore_ci_test/test_fixture_node --command src/ros2/rclcpp_multicore_ci_test/test/test_nodes.py

In general I've encountered fewer test failures when using more and more cores, which I think is because there's a lower chance of the feedback callback handling being blocked. I also see few (if any) failures when using exactly one core, which I think is because the callbacks are always handled in the right order in this case.

iuhilnehc-ynos commented 3 years ago

It's a very interesting issue.

Even if aciton server rcl_action_send_goal_response is called inside rclcpp_action before user call goal_handle->publish_feedback(feedback); at application, there might be exist three cases about which packet data come first.

case 1

[normal case, expectation] goal_response data arrived, and then the executor call ClientBase::handle_goal_response; then the feedback data come after.
case 2

goal_response and feedback data are all stored into DDS buffer(socket function is faster than ros2 executor), the bug happened when the executor to deal with this waitable entity, the PR of @fujitatomoya can solve this case.
case 3

ROS2 can't guarantee which packet data come first. As we know, UDP is "unordered". There is no GUARANTEE that when sending packet A and then packet B to a client, the client will see first packet A and then packet B. It may see first B then A. feedback data arrived and is stored into DDS buffer, (at this time, the goal_response not coming), the executor still to deal with this waitable entity, and then to call ClientBase::handle_feedback_message, so the bug still happened.

I can't find an easy way (using the existing variable) to resolve it. The following steps might be a possible solution,

add a method into rclcpp::Waitable::can_take_data() { return true;}
add a flag in rclcpp_action::ClientBase and override rclcpp_action::ClientBase::can_take_data() { // use the flag; }
the flag to control ClientBase::take_data
add if any_executable.waitable->can_take_data() before any_executable.data = any_executable.waitable->take_data(); in the executor.

Mixing multiple entities into a Waitable object but the business logic caring about processing the order of entities, that's the problem.

fujitatomoya commented 3 years ago

@iuhilnehc-ynos thanks for checking on this, can you make this problem happen? so far, i just cannot see this problem, but in theory there is a few race condition we could have as you mentioned.

iuhilnehc-ynos commented 3 years ago

@fujitatomoya

can you make this problem happen?

No, I can't. (Just thinking in theory, sorry for making the noise.)

KavenYau commented 3 years ago

I think the main question is should every feedback_message be received safely by the client? If not, perhaps this is not a real problem.:flushed:

schornakj commented 3 years ago

I think the main question is should every feedback_message be received safely by the client? If not, perhaps this is not a real problem.:flushed:

If I publish a message I expect it to be successfully received on the other end, rather than being mysteriously consumed by the client waitable handling implementation.

In my project I'm intermittently publishing action feedback messages to give info about the state of action execution, and it's important that the entities monitoring those messages on the client side don't miss any feedback updates so they still get an accurate depiction of the action status. It's possible to refactor this to not use action feedback, but I think it's more important to make sure that these aspects of the action client actually do always work as expected.

KavenYau commented 3 years ago

but I think it's more important to make sure that these aspects of the action client actually do always work as expected.

@fujitatomoya Is there any design document to clarify the expected behaviors of the feedback_message?

static const rmw_qos_profile_t rmw_qos_profile_default =
{
  RMW_QOS_POLICY_HISTORY_KEEP_LAST,
  10,
  RMW_QOS_POLICY_RELIABILITY_RELIABLE,
  RMW_QOS_POLICY_DURABILITY_VOLATILE,
  RMW_QOS_DEADLINE_DEFAULT,
  RMW_QOS_LIFESPAN_DEFAULT,
  RMW_QOS_POLICY_LIVELINESS_SYSTEM_DEFAULT,
  RMW_QOS_LIVELINESS_LEASE_DURATION_DEFAULT,
  false
};

It currently using rmw_qos_profile_default as the qos of feedback_message. So with #1684, I think in most cases it would work as you expected.

@iuhilnehc-ynos 's analysis (https://github.com/ros2/rclcpp/issues/1679#issuecomment-854401456) is pretty good, and the case 3 is complicated. In this case, you might miss the first few feedbacks or the last few feedbacks, or receive few disorder feedbacks. Actually I don't know the probability of its appearance, but I have not met this issue yet.

fujitatomoya commented 3 years ago

Is there any design document to clarify the expected behaviors of the feedback_message?

No i do not think so.

@ivanpauno @wjwwood would you share your thoughts on this?

wjwwood commented 3 years ago

There are some examples here:

https://design.ros2.org/articles/actions.html#clientserver-interaction-examples

But all of them indicate feedback is sent after the result request is received. I wonder if the intention is that you should not send feedback for a goal until the result request is received. This would ensure that feedback is never sent before the client has received the "goal requested" response.

CC @jacobperron, he might have more insight.

ivanpauno commented 3 years ago

I think the problem is that goal_handle hadn't been inserted yet while the feedback message came.

Maybe the GoalHandle should be inserted when sending the goal request and not when receiving the goal response to avoid the race. I'm not completely sure if that's an okay solution or not.

jacobperron commented 3 years ago

Maybe the GoalHandle should be inserted when sending the goal request and not when receiving the goal response to avoid the race.

We should be careful with this, since I think the intention is to only track "accepted" goal handles. IIRC, if a goal is rejected, then we don't even create a GoalHandle object.

jacobperron commented 3 years ago

There are some examples here:

https://design.ros2.org/articles/actions.html#clientserver-interaction-examples

But all of them indicate feedback is sent after the result request is received.

I don't think we need to have sent a result request to get feedback.

As described earlier in this thread, I'm pretty sure the issue is the race between getting the GoalResponse from the action server and the first feedback message(s). Since a we don't create a goal handle until after receiving a GoalResponse, we can't update any feedback for it prior:

https://github.com/ros2/rclcpp/blob/1c617515403db16b8cbabd834073eaca0f9a452d/rclcpp_action/include/rclcpp_action/client.hpp#L468-L473

In my mind, I never considered losing a couple feedback messages to be a problem. But I can see how this may depend on the application.

A possible workaround would be to delay the publish of feedback messages from the action server until we can confirm that the goal response was received (I don't know if this is technically possible).

As @ivanpauno proposes, we could try to track goals before getting a GoalResponse, though I think this was originally avoided to keep the implementation "simple". I think the code architecture would have to change significantly to handle this case, thought I haven't investigated.

wjwwood commented 3 years ago

I don't think we need to have sent a result request to get feedback.

As described earlier in this thread, I'm pretty sure the issue is the race between getting the GoalResponse from the action server and the first feedback message(s).

Ah, I see, there's no requirement that the client request a response immediately. I was just assuming they would, but that isn't the case.

A possible workaround would be to delay the publish of feedback messages from the action server until we can confirm that the goal response was received (I don't know if this is technically possible).

I was thinking that if you ensure the client immediately requests a response then that could tell you for sure the client has received the goal response, but that might not be a good solution if the client needs to be doing other things in the meantime.

As @ivanpauno proposes, we could try to track goals before getting a GoalResponse, though I think this was originally avoided to keep the implementation "simple". I think the code architecture would have to change significantly to handle this case, thought I haven't investigated.

Right, this gets into queuing and figuring out how many feedback to store per goal and how long to keep them before deciding that the feedback for that goal should be discarded because the GoalResponse is never coming (either the server died or the service response was otherwise lost).

jacobperron commented 3 years ago

I was thinking that if you ensure the client immediately requests a response then that could tell you for sure the client has received the goal response, but that might not be a good solution if the client needs to be doing other things in the meantime.

I think you're suggesting that we use the result request as a signal that it's safe to start sending feedback. I think this would work, but assumes that the action client will send a result request right away. I'm pretty sure if a result callback is registered, then a result request is sent right away. Though sending a result request right away is probably the most common use case, it's possible for the action client to defer sending the result request until much later or not at all (e.g. I think it's possible to send a goal, get feedback, and just ignore the result).

fujitatomoya commented 3 years ago

In my mind, I never considered losing a couple feedback messages to be a problem.

So did I.

Right, this gets into queuing and figuring out how many feedback to store per goal and how long to keep them before deciding that the feedback for that goal should be discarded because the GoalResponse is never coming (either the server died or the service response was otherwise lost).

totally agree.

at least, i think https://github.com/ros2/rclcpp/pull/1684 should be applied to improve (but not guarantee), and probably we would want to reconsider the order to take data on client.

jacobperron commented 3 years ago

1684 looks okay to me, but I'm not convinced it will actually improve the situation since there's no guarantee on the arrival order of messages.

Really, we need to answer: should users be able to get guaranteed delivery of all feedback messages?

If the answer is yes, then I think a reasonable solution is to queue feedback messages (and maybe status messages) on the action client until it receives a goal response.

Queuing could result in a burst of feedback messages if there's a delay in receiving the goal response message or increased memory usage if the goal response is lost. I think the feedback queue size should be configurable by the user so that they have the option to keep the current behaviour (queue size 0) or tune the size to their application.

KavenYau commented 3 years ago

If the answer is yes, then I think a reasonable solution is to queue feedback messages (and maybe status messages) on the action client until it receives a goal response.

What if this is a race between getting the GoalResult from the action server and the last feedback message(s)? Perhaps it's a unsolvable issue which caused by unordered messages.

ivanpauno commented 3 years ago

What if this is a race between getting the GoalResult from the action server and the last feedback message(s)? Perhaps it's a unsolvable issue which caused by unordered messages.

For these kinds of things, I would like to have a way to indicate that order has to be respected (i.e. that the action server response to the goal request and the feedback topic writes should maintain order in the receiver side). This is possible in DDS, by using the same DDS Publisher in both data writers (the goal request response writer and the feedback writer) and using [GROUP PRESENTATION QOS] (https://community.rti.com/static/documentation/connext-dds/5.2.0/doc/manuals/connext_dds/html_files/RTI_ConnextDDS_CoreLibraries_UsersManual/Content/UsersManual/PRESENTATION_QosPolicy.htm).

Having an equivalent in ROS 2 API would be nice, but it isn't obvious. "ROS Publishers" are sort of equivalent to "DDS Data Writers" and we generally don't control the "DDS publishers" the rmw implementation uses (most of them are using only one for all the writers now, I think that some used to create a publisher per data writer).

dhood commented 1 year ago

Hi, I don't have a repeatable example for this, but want to highlight that I've seen this issue in an rclpy executor as well (using Humble and Cyclone DDS):

EDIT: I now believe this to have been caused by a bug on our rclpy code where the node was being spun in multiple threads by mistake - can probably be disregarded for the purposes of this conversation, but I'll leave the comment for posterity.

[Exception in thread Thread-1 (spin):
Traceback (most recent call last):
 File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
   self.run()
 File "/usr/lib/python3.10/threading.py", line 953, in run
   self._target(*self._args, **self._kwargs)
 File "/opt/atlassian/pipelines/agent/build/install/test_control/lib/python3.10/site-packages/test_control/test_robotgui.py", line 1334, in spin
   self.executor.spin()
 File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 279, in spin
   self.spin_once()
 File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 764, in spin_once
   self._spin_once_impl(timeout_sec)
 File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 761, in _spin_once_impl
   future.result()
 File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/task.py", line 94, in result
   raise self.exception()
 File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/task.py", line 239, in __call__
   self._handler.send(None)
 File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 418, in handler
   await call_coroutine(entity, arg)
 File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 372, in _execute_service
   response = await await_or_execute(srv.callback, request, srv.srv_type.Response())
 File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 107, in await_or_execute
   return callback(*args)
 File "/opt/atlassian/pipelines/agent/build/install/test_control/lib/python3.10/site-packages/test_control/test_robotgui.py", line 627, in move_to_testing_pose_service_callback
   self.move_robot_to_point(vel_scaling=vel_scaling,
 File "/opt/atlassian/pipelines/agent/build/install/test_control/lib/python3.10/site-packages/test_control/test_robotgui.py", line 888, in move_robot_to_point
   rclpy.spin_once(self.action_client_node)
 File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/__init__.py", line 202, in spin_once
   executor.spin_once(timeout_sec=timeout_sec)
 File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 705, in spin_once
   handler, entity, node = self.wait_for_ready_callbacks(timeout_sec=timeout_sec)
 File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 691, in wait_for_ready_callbacks
   return next(self._cb_iter)
 File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 610, in _wait_for_ready_callbacks
 File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/action/client.py", line 231, in is_ready
   ready_entities = self._client_handle.is_ready(wait_set)
clpy._rclpy_pybind11.RCLError: Failed to get number of ready entities for action client: wait set index for feedback subscription is out of bounds, at ./src/rcl_action/action_client.c:619

It happened while waiting for the status (also in CI, but Bitbucket Pipelines which I've read uses 4 cores instead of 2 but haven't confirmed). I'll keep an eye out for future occurrences.. hopefully it's contained to CI.

            rclpy.spin_once(self.action_client_node)

ros2 / rclcpp

Action client feedback callback does not reliably trigger #1679

Bug report

Expected behavior

Actual behavior

Steps to reproduce issue

Additional information

1684 looks okay to me, but I'm not convinced it will actually improve the situation since there's no guarantee on the arrival order of messages.