ros2 / rclcpp

rclcpp (ROS Client Library for C++)
Apache License 2.0
548 stars 417 forks source link

Double spin required since 28.2.0 #2589

Open SteveMacenski opened 3 months ago

SteveMacenski commented 3 months ago

Bug report

Required Info:

Expected behavior

Spin All, Spin Some work as previously expected

Actual behavior

Spinning some or all requires a warmup spin that doesn't do anything before processing useful work.

Additional information

Nav2's CI broke recently with a unit test on a behavior tree node that I got to looking into yesterday. I was chatting with @clalancette about it in a comment thread and we realized that this is a rclcpp regression.

goal_updater_node spins an internal executor to process a callback function to get some data that has always worked until recently: https://github.com/ros-navigation/navigation2/blob/main/nav2_behavior_tree/plugins/decorator/goal_updater_node.cpp#L60. In debugging, I found that it wasn't getting the data the unit test was sending.

Blaming Executor::spin_some_impl(std::chrono::nanoseconds max_duration, bool exhaustive), I see there's been alot of poking around there recently in the last few months, and a PR to fix some other regression in May, which makes me think maybe there's another regression lurking in that corner.

Test A:

  callback_group_executor_.spin_all(std::chrono::milliseconds(200));

Pushing the 50ms up to 200ms still doesn't work

Test B:

  callback_group_executor_.spin_all(std::chrono::milliseconds(50));
  callback_group_executor_.spin_all(std::chrono::milliseconds(1));

This works

Test C:

   callback_group_executor_.spin_some(std::chrono::milliseconds(50));

Attempting to use spin some instead of spin all also fails

Test D:

Lets try ticking the node multiple times while having the single spin_all(50ms) (which would cause the spin to be called multiple times)

  tree_->rootNode()->executeTick();
  tree_->rootNode()->executeTick();
  tree_->rootNode()->executeTick();

This works

This experiment proves to me that something is odd with spinning behavior right now requiring multiple calls (whether sequentially or as part of the broader application loop) to process data. The first run is having no effect. I see this running in Nav2's CI that is running Cyclone and I can replicate this locally in a Docker container with Fast-DDS, so I think its a ROS 2 side issue, not DDS.

mjcarroll commented 3 months ago

Thanks for reporting, will take a look. Definitely sounds fishy.

alsora commented 3 months ago

I wonder if this issue is related to the use of the notifier waitable? If you add entities to a node that has already been added to an executor:

so with the new approach, when you call spin_* the only item of work ready is the waitable to refresh the entities (even if those entities already have work to do), calling spin a second time will now act on the updated waitset (which includes the new entities and will thus see their items of work)

The purpose of this change is that the previous behavior was a huge CPU bottleneck.

alsora commented 3 months ago

IMO this "bug" is expected behavior with the current code(although it's a breaking change and we didn't advertise it...).

See how in the unit-tests we have a spin_all or multiple spin_once to "consume waitable events" before starting the actual test: https://github.com/ros2/rclcpp/blob/rolling/rclcpp/test/rclcpp/executors/test_events_executor.cpp#L142-L171

The only ways that I can think of to fix it, consist of treating the notifier waitable as a "special entity", so that spin_some and spin_all go back to spinning if that's triggered, but I don't really like it.

SteveMacenski commented 3 months ago

is expected behavior with the current code

That seems like potentially an oversight. For subscribers on regular data, its not a huge deal IMO to drop the first message while the system is 'warming up' (i.e. first lidar scan), but many subscriptions are the result of infrequent or irregular events (like button pushes) and missing the first of an event class is comparatively bad to be the standard ROS 2 out-of-the-box policy.

See how in the unit-tests we have a spin_all or multiple spin_once to "consume waitable events" before starting the actual test

I don't think that anything about these need to change, but it would be good to have a unit test dedicated to regressions on 'first-spinning' events. In effect, that's what we caught in Nav2

alsora commented 3 months ago

No events are lost. Indeed if using spin you won't notice any difference.

The only difference is that if you call spin_once or the other variations that collect events only once, the events may be processed in the following iteration if they are originated from a newly added entity (this assuming that you are calling spin_once in a loop).

The "warm up" spin is needed only for entities that are added after the node / cb group is added to the executor. If multiple entities are added, a single spin iteration will setup them all.

In your example, if you create the subscription before adding the callback group to the executor you should solve the issue (for example moving this line to the end of the constructor https://github.com/ros-navigation/navigation2/blob/main/nav2_behavior_tree/plugins/decorator/goal_updater_node.cpp#L40)

fujitatomoya commented 3 months ago

I am not sure how to fix this... but wanted to post the comment for the direction we take...

The "warm up" spin is needed only for entities that are added after the node / cb group is added to the executor. If multiple entities are added, a single spin iteration will setup them all.

IMO this breaks the user application, huge behavior change... i really do not think requiring warmup spin before actual data processing with spin_xxx would be good user experience or API even with well-documented.

alsora commented 3 months ago

This only happens when spin_xxx is not called in a loop AND entities have been created after the node / cb group was added to the executor AND the user has an expectation on some events to be processed. IMO this is not a common situation, as spin_xxx should usually be called in a loop and the user doesn't have such expectations.

The problem affect mostly unit-tests where developers are trying to synchronize some operations: I do X, I spin and expect to get the message. And as mentioned in the previous message, this is easily solvable in unit-tests such as the one posted by @SteveMacenski by simply adding the cb group to the executor AFTER the entity has been created.

The "issue" is that executors have internal events and calling spin_xxx only once may execute those events rather than the user-defined ones.

alsora commented 3 months ago

After writing that message, I went to read the documentation of the spin_xxx functions, and maybe there is actually a bug.

SteveMacenski commented 3 months ago

No events are lost.

That is my verbal mistake. Events are not triggered when attempted to process the first time :+1:

In your example, if you create the subscription before adding the callback group to the executor you should solve the issue (for example moving this line to the end of the constructor https://github.com/ros-navigation/navigation2/blob/main/nav2_behavior_tree/plugins/decorator/goal_updater_node.cpp#L40)

I could to do that for this case in Nav2, but I agree with the above comment from @fujitatomoya - this is not a good UX for ROS 2 users that would cause some head scratching and frustration (and feed the 'ROS sucks' trolls). That's a pretty subtle detail that doesn't seem like from an application developer's perspective should matter, but I can ack from a developer side why that could happen:

IMO this breaks the user application, huge behavior change... i really do not think requiring warmup spin before actual data processing with spin_xxx would be good user experience or API even with well-documented.

alsora commented 3 months ago

I wrote a unit-test to exercise this behavior and run it across all executors... However:

The only way in which I made it fail was to do spin_all(std::chrono::microseconds(1)) which obviously runs out of time after executing the internal event so it won't even try to collect more work, so it's kind of equivalent to a spin_some

ros-discourse commented 2 months ago

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/next-client-library-wg-meeting-friday-2nd-august-2024/38881/1

jmachowinski commented 2 months ago

I guess we need to modify wait_for_work in the following way :

Alternative: At start of wait for work:

Second version sounds cleaner to me.

jmachowinski commented 2 months ago

Something like this :

  if(notify_waitable_)
  {
    // rcl::Wait
    executors::ExecutorNotifyWaitable::SharedPtr cpy = std::make_shared<executors::ExecutorNotifyWaitable>(*notify_waitable_);

    rclcpp::WaitSet tmp_wait_set_;
    tmp_wait_set_.add_waitable(cpy);
    auto res = tmp_wait_set_.wait(std::chrono::nanoseconds(0));
    if(res.kind() == WaitResultKind::Ready)
    {
        auto & rcl_wait_set = res.get_wait_set().get_rcl_wait_set();
        if (cpy->is_ready(rcl_wait_set)) {
          cpy->execute(cpy->take_data());
        }
    }
  }

I just tested this. This indeed fixes the bug, but some other tests fail because of it...

alsora commented 2 months ago

I digged into this issue and I found the root cause. I added a second test in https://github.com/ros2/rclcpp/pull/2591 that can successfully reproduce the problem. The PR https://github.com/ros2/rclcpp/pull/2595 fixes the problem in the unit-test.

This bug affects the SingleThreadedExecutor and the MultiThreadedExecutor. The EventsExecutor passes the test.

The current usage of the NotifyWaitable in wait-set based executors requires to fully handle this waitable in a special way as it can't ever be executed in normal ways (besides the aforementioned check, we also enforce elsewhere that every executable must have a callback group https://github.com/ros2/rclcpp/blob/rolling/rclcpp/src/rclcpp/executor.cpp#L457-L459)

This problem can be mitigated if you have multiple entities in your executor, but this is not the case in the Nav2 unit-test, where the notify waitable is the only entity managed by the executor when the test starts (there's also the goal updated subscription but due to the ordering of operations this is not immediately known to the executor). If the notify waitable is not the only entity with some work to do, then the "exhaustive" condition would have triggered a new wait_for_work.

This implies that the notify waitable is only executed explicitly here https://github.com/ros2/rclcpp/blob/rolling/rclcpp/src/rclcpp/executor.cpp#L744-L748, and never as part of a

get_next_executable()
execute_any_executable()

This lines happens at the end of a wait_for_work, but the update of the entity collection happens only at the beginning of wait_for_work. That's why in these scenarios we need to call it twice. For this reason, I think that @jmachowinski proposed fix makes sense, as it fixes the problem by fully handling the notify waitable in a single wait for work iteration.

alsora commented 2 months ago

The TL;DR is that:

@jmachowinski PR fixes both points for the waitset-based executors: https://github.com/ros2/rclcpp/pull/2595 The events executor does not have the spin_all bug, but has the spin_some limitation. I opened a draft PR to fix that: https://github.com/ros2/rclcpp/pull/2597.

Nothing should be done for spin_once since this does not guarantee that the single entity that will be executed is the one expected by the user.

We should add a disclaimer about the spin_some limitation in the StaticSingleThreadedExecutor. We already fixed this thing in two different ways for two executors, I would like to not have to do it a third time, especially since this executor should be deprecated.

ros-discourse commented 2 months ago

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/next-client-library-wg-meeting-friday-16th-august-2024/39130/1

SteveMacenski commented 1 month ago

Is there progress here?

alsora commented 1 month ago

@SteveMacenski I discussed this issue further with @jmachowinski .

The events executor PR has some minor issues to fix before merging it https://github.com/ros2/rclcpp/pull/2591

We understood the root cause of the problem for the waitset-baser executors. The PR https://github.com/ros2/rclcpp/pull/2595 requires more work as the approach currently proposed in it , is not the correct way to fix.

I should have some time to work on both PRs next week.

SteveMacenski commented 1 month ago

Ok, thanks!

doisyg commented 1 month ago

IMO this breaks the user application, huge behavior change... i really do not think requiring warmup spin before actual data processing with spin_xxx would be good user experience or API even with well-documented.

Fully agree here @fujitatomoya

I should have some time to work on both PRs next week.

Thanks, following closely as I am attempting to migrate our huge code base to Jazzy and we have a lot of spin_some outside loops. I don't want to start thinking about the implications if the behavior changes from Iron to Jazzy, so I will wait for this to be resolved.