ros2 / rmw_cyclonedds

ROS 2 RMW layer for Eclipse Cyclone DDS
Apache License 2.0
107 stars 89 forks source link

Set received_timestamp to system_clock::now() in message_info #491

Closed MichaelOrlov closed 1 month ago

MichaelOrlov commented 1 month ago
MichaelOrlov commented 1 month ago

cc: @clalancette

MichaelOrlov commented 1 month ago

Pulls: ros2/rmw_cyclonedds#491 Gist: https://gist.githubusercontent.com/MichaelOrlov/ff54ec8c3c5bc15af4efb81f790fa4e3/raw/10e2986bf8ca550de03356ae87e0366ae5fcea20/ros2.repos BUILD args: --packages-above-and-dependencies rmw_cyclonedds ros2bag rosbag2_transport rosbag2_tests TEST args: --packages-above rmw_cyclonedds ros2bag rosbag2_transport rosbag2_tests ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/13833

MichaelOrlov commented 1 month ago

Pulls: ros2/rmw_cyclonedds#491 Gist: https://gist.githubusercontent.com/MichaelOrlov/b4eb11ed0ac1fce5e8295432d7428ffe/raw/10e2986bf8ca550de03356ae87e0366ae5fcea20/ros2.repos BUILD args: --packages-above-and-dependencies rmw_cyclonedds_cpp ros2bag rosbag2_transport rosbag2_tests TEST args: --packages-above rmw_cyclonedds_cpp ros2bag rosbag2_transport rosbag2_tests ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/13834

clalancette commented 1 month ago

Ah, so the tests failing in CI are related.

The problem is that now that we have timestamp support, we need to update https://github.com/ros2/rcl/blob/f19067726b9bc2d0b90ac0d0f246d1d7821d41ee/rcl/test/CMakeLists.txt#L298-L316 to enable RMW_TIMESTAMPS_SUPPORTED for rmw_cyclonedds_cpp. I can open a PR for that in a little while here.

clalancette commented 1 month ago

Here is new CI with both this and https://github.com/ros2/rcl/pull/1156:

clalancette commented 1 month ago

CI revealed that we also have to update the receive_timestamp for services. That should fix some of the failing tests in CI, but @MichaelOrlov there are quite a number of failing tests for rosbag2 as well. Can you take a look at those?

MichaelOrlov commented 1 month ago

@clalancette I will take a look

MichaelOrlov commented 1 month ago

@clalancette It is a bit unexpected error in rosbag2 player

[ERROR] [1715177950.046068203] [rosbag2_player]: Failed to publish message on 'loop_test_topic' topic. 
9: Error: failed to borrow loaned msg: error not set
9: [ERROR] [1715177950.046081143] [rosbag2_player]: Failed to publish message on 'loop_test_topic' topic. 
9: Error: failed to borrow loaned msg: error not set
MichaelOrlov commented 1 month ago

In rosbag2 Player, we use rclcpp::GenericPublisher::publish_as_loaned_msg if the publisher allows to loan messages https://github.com/ros2/rosbag2/blob/91468782eccd6b11a99e2da80f313fb7ea722d40/rosbag2_transport/src/rosbag2_transport/player.cpp#L242-L253

    explicit PlayerPublisher(
      std::shared_ptr<rclcpp::GenericPublisher> pub,
      bool disable_loan_message)
    : publisher_(std::move(pub))
    {
      using std::placeholders::_1;
      if (disable_loan_message || !publisher_->can_loan_messages()) {
        publish_func_ = std::bind(&rclcpp::GenericPublisher::publish, publisher_, _1);
      } else {
        publish_func_ = std::bind(&rclcpp::GenericPublisher::publish_as_loaned_msg, publisher_, _1);
      }
    }

It seems, for some reason, the .borrow_loaned_message() inside rclcpp::GenericPublisher::publish_as_loaned_msg fails. However, I have no idea why for a while.

MichaelOrlov commented 1 month ago

@clalancette I've tried to checkout changes from this PR and from dependent https://github.com/ros2/rcl/pull/1156 and build everything from scratch on my local machine. All rosbag2 tests passing on my local setup either with rmw_cyclonddds and rmw_fastrps. May be we are missing some flags for CI run?

clalancette commented 1 month ago

@clalancette I've tried to checkout changes from this PR and from dependent ros2/rcl#1156 and build everything from scratch on my local machine. All rosbag2 tests passing on my local setup either with rmw_cyclonddds and rmw_fastrps. May be we are missing some flags for CI run?

Hm, that's weird. On my local setup, I can reproduce the errors with no problem.

That said, we definitely have a bug in rclcpp_lifecycle caused by https://github.com/ros2/rclcpp/pull/2450 . In particular, this is causing errors to happen in CI ever since it has been merged:

      Start  7: test_lifecycle_publisher

7: Test command: /usr/bin/python3 "-u" "/home/jenkins-agent/workspace/nightly_linux_release/ws/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/home/jenkins-agent/workspace/nightly_linux_release/ws/build/rclcpp_lifecycle/test_results/rclcpp_lifecycle/test_lifecycle_publisher.gtest.xml" "--package-name" "rclcpp_lifecycle" "--output-file" "/home/jenkins-agent/workspace/nightly_linux_release/ws/build/rclcpp_lifecycle/ament_cmake_gtest/test_lifecycle_publisher.txt" "--command" "/home/jenkins-agent/workspace/nightly_linux_release/ws/build/rclcpp_lifecycle/test_lifecycle_publisher" "--gtest_output=xml:/home/jenkins-agent/workspace/nightly_linux_release/ws/build/rclcpp_lifecycle/test_results/rclcpp_lifecycle/test_lifecycle_publisher.gtest.xml"
7: Working Directory: /home/jenkins-agent/workspace/nightly_linux_release/ws/build/rclcpp_lifecycle
7: Test timeout computed to be: 60
7: -- run_test.py: invoking following command in '/home/jenkins-agent/workspace/nightly_linux_release/ws/build/rclcpp_lifecycle':
7:  - /home/jenkins-agent/workspace/nightly_linux_release/ws/build/rclcpp_lifecycle/test_lifecycle_publisher --gtest_output=xml:/home/jenkins-agent/workspace/nightly_linux_release/ws/build/rclcpp_lifecycle/test_results/rclcpp_lifecycle/test_lifecycle_publisher.gtest.xml
7: Running main() from /home/jenkins-agent/workspace/nightly_linux_release/ws/install/gtest_vendor/src/gtest_vendor/src/gtest_main.cc
7: [==========] Running 4 tests from 1 test suite.
7: [----------] Global test environment set-up.
7: [----------] 4 tests from PerTimerType/TestLifecyclePublisher
7: [ RUN      ] PerTimerType/TestLifecyclePublisher.publish_managed_by_node/wall_timer
7: [WARN] [1715154999.943244122] [LifecyclePublisher]: Trying to publish message on the topic '/topic', but the publisher is not activated
7: [ERROR] [1715154999.943921662] [node]: Unable to start transition 6 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at /home/jenkins-agent/workspace/nightly_linux_release/ws/src/ros2/rcl/rcl/src/rcl/publisher.c:423, at /home/jenkins-agent/workspace/nightly_linux_release/ws/src/ros2/rcl/rcl_lifecycle/src/rcl_lifecycle.c:368
7: [WARN] [1715154999.943982879] [rclcpp_lifecycle]: Shutdown error in destruction of LifecycleNode: final state(inactive)

I think this happens because the LifecycleNode destructor cannot assume that the publisher is still valid at this point. @fujitatomoya FYI, I may propose reverting that change for now.

clalancette commented 1 month ago

In any case, since we are getting really weird behavior here, I'm going to try another CI to see what happens with my latest change:

MichaelOrlov commented 1 month ago

@clalancette I figured out that I have a bit outdated versions of the other packages (about 3 weeks old). It might explain why test passes on my local setup.

I am trying to update the rclcpp package and reproduce the failure again. Perhaps I will be able to dissect with which PR we interfere if the issue starts reproducing on my local setup.

clalancette commented 1 month ago

I'll also point out https://github.com/eProsima/Fast-DDS/pull/4681 , which is suspiciously in this same area. I suspect maybe that one has a problem; if we fail again in CI, I'm going to try going earlier than that commit and seeing if that is the problem.

MichaelOrlov commented 1 month ago

No luck in reproducing failures with the latest rclcpp on my local setup. @clalancette BTW I think that https://github.com/eProsima/Fast-DDS/pull/4681 is unrelated because you have a pinned 2.14.x version of the Fast-DDS in your gist

 eProsima/Fast-CDR:
    type: git
    url: https://github.com/eProsima/Fast-CDR.git
    version: 2.2.x
  eProsima/Fast-DDS:
    type: git
    url: https://github.com/eProsima/Fast-DDS.git
    version: 2.14.x

And i have the same version in my local setup.

MichaelOrlov commented 1 month ago

Hm, from another hand, the rosbag2 tests failed on CI only with rmw_fastrtps_cpp

clalancette commented 1 month ago

@clalancette BTW I think that eProsima/Fast-DDS#4681 is unrelated because you have a pinned 2.14.x version of the Fast-DDS in your gist

That's not pinned; it takes the latest from the 2.14.x branch.

MichaelOrlov commented 1 month ago

@clalancette Ok this is certainly the Fast-DDS package. I've just updated it locally to the latest master and got the same errors as on CI.

clalancette commented 1 month ago

All right, I'm going to try backing up to an earlier version of Fast-DDS in my repos file. Here's another CI try with that:

MichaelOrlov commented 1 month ago

@clalancette It seems you were right about https://github.com/eProsima/Fast-DDS/pull/4681. I've just checked-out locally revision c2a4523a21a94302298bc995057719546b72df2d which is right before https://github.com/eProsima/Fast-DDS/pull/4681 and issues gone on my lcoal setup. Now building with https://github.com/eProsima/Fast-DDS/pull/4681 to verify that it is a source of the failures.

MichaelOrlov commented 1 month ago

@clalancette I confirm that https://github.com/eProsima/Fast-DDS/pull/4681 is a source of errors in rosbag2 Player with messages:

 [ERROR] [1715197871.773078660] [rosbag2_player]: Failed to publish message on 'topic1' topic. 
2: Error: failed to borrow loaned msg: error not set
2: [ERROR] [1715197871.872948366] [rosbag2_player]: Failed to publish message on 'topic1' topic. 
2: Error: failed to borrow loaned msg: error not set

We can start preparing reverting PR. Don't need to wait until CI will finish.

clalancette commented 1 month ago

Here's a PR to pin us back to a known working version of Fast-DDS: https://github.com/ros2/ros2/pull/1560

clalancette commented 1 month ago

All right, CI in https://github.com/ros2/rmw_cyclonedds/pull/491#issuecomment-2101288484 was successful, and we've now pinned that version in both Rolling and Jazzy. With that, this is approved alongside https://github.com/ros2/rcl/pull/1156 , so I'm going to merge both of them and backport both to jazzy. Thanks everyone for the work and reviews here.

clalancette commented 1 month ago

@mergifyio backport jazzy

mergify[bot] commented 1 month ago

backport jazzy

✅ Backports have been created

* [#493 Set received_timestamp to system_clock::now() in message_info (backport #491)](https://github.com/ros2/rmw_cyclonedds/pull/493) has been created for branch `jazzy`