Closed MichaelOrlov closed 1 month ago
cc: @clalancette
Pulls: ros2/rmw_cyclonedds#491 Gist: https://gist.githubusercontent.com/MichaelOrlov/ff54ec8c3c5bc15af4efb81f790fa4e3/raw/10e2986bf8ca550de03356ae87e0366ae5fcea20/ros2.repos BUILD args: --packages-above-and-dependencies rmw_cyclonedds ros2bag rosbag2_transport rosbag2_tests TEST args: --packages-above rmw_cyclonedds ros2bag rosbag2_transport rosbag2_tests ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/13833
Pulls: ros2/rmw_cyclonedds#491 Gist: https://gist.githubusercontent.com/MichaelOrlov/b4eb11ed0ac1fce5e8295432d7428ffe/raw/10e2986bf8ca550de03356ae87e0366ae5fcea20/ros2.repos BUILD args: --packages-above-and-dependencies rmw_cyclonedds_cpp ros2bag rosbag2_transport rosbag2_tests TEST args: --packages-above rmw_cyclonedds_cpp ros2bag rosbag2_transport rosbag2_tests ROS Distro: rolling Job: ci_launcher ci_launcher ran: https://ci.ros2.org/job/ci_launcher/13834
Ah, so the tests failing in CI are related.
The problem is that now that we have timestamp support, we need to update https://github.com/ros2/rcl/blob/f19067726b9bc2d0b90ac0d0f246d1d7821d41ee/rcl/test/CMakeLists.txt#L298-L316 to enable RMW_TIMESTAMPS_SUPPORTED for rmw_cyclonedds_cpp
. I can open a PR for that in a little while here.
Here is new CI with both this and https://github.com/ros2/rcl/pull/1156:
CI revealed that we also have to update the receive_timestamp for services. That should fix some of the failing tests in CI, but @MichaelOrlov there are quite a number of failing tests for rosbag2 as well. Can you take a look at those?
@clalancette I will take a look
@clalancette It is a bit unexpected error in rosbag2 player
[ERROR] [1715177950.046068203] [rosbag2_player]: Failed to publish message on 'loop_test_topic' topic.
9: Error: failed to borrow loaned msg: error not set
9: [ERROR] [1715177950.046081143] [rosbag2_player]: Failed to publish message on 'loop_test_topic' topic.
9: Error: failed to borrow loaned msg: error not set
In rosbag2 Player, we use rclcpp::GenericPublisher::publish_as_loaned_msg
if the publisher allows to loan messages
https://github.com/ros2/rosbag2/blob/91468782eccd6b11a99e2da80f313fb7ea722d40/rosbag2_transport/src/rosbag2_transport/player.cpp#L242-L253
explicit PlayerPublisher(
std::shared_ptr<rclcpp::GenericPublisher> pub,
bool disable_loan_message)
: publisher_(std::move(pub))
{
using std::placeholders::_1;
if (disable_loan_message || !publisher_->can_loan_messages()) {
publish_func_ = std::bind(&rclcpp::GenericPublisher::publish, publisher_, _1);
} else {
publish_func_ = std::bind(&rclcpp::GenericPublisher::publish_as_loaned_msg, publisher_, _1);
}
}
It seems, for some reason, the .borrow_loaned_message()
inside rclcpp::GenericPublisher::publish_as_loaned_msg
fails. However, I have no idea why for a while.
@clalancette I've tried to checkout changes from this PR and from dependent https://github.com/ros2/rcl/pull/1156 and build everything from scratch on my local machine. All rosbag2 tests passing on my local setup either with rmw_cyclonddds
and rmw_fastrps
.
May be we are missing some flags for CI run?
@clalancette I've tried to checkout changes from this PR and from dependent ros2/rcl#1156 and build everything from scratch on my local machine. All rosbag2 tests passing on my local setup either with
rmw_cyclonddds
andrmw_fastrps
. May be we are missing some flags for CI run?
Hm, that's weird. On my local setup, I can reproduce the errors with no problem.
That said, we definitely have a bug in rclcpp_lifecycle caused by https://github.com/ros2/rclcpp/pull/2450 . In particular, this is causing errors to happen in CI ever since it has been merged:
Start 7: test_lifecycle_publisher
7: Test command: /usr/bin/python3 "-u" "/home/jenkins-agent/workspace/nightly_linux_release/ws/install/ament_cmake_test/share/ament_cmake_test/cmake/run_test.py" "/home/jenkins-agent/workspace/nightly_linux_release/ws/build/rclcpp_lifecycle/test_results/rclcpp_lifecycle/test_lifecycle_publisher.gtest.xml" "--package-name" "rclcpp_lifecycle" "--output-file" "/home/jenkins-agent/workspace/nightly_linux_release/ws/build/rclcpp_lifecycle/ament_cmake_gtest/test_lifecycle_publisher.txt" "--command" "/home/jenkins-agent/workspace/nightly_linux_release/ws/build/rclcpp_lifecycle/test_lifecycle_publisher" "--gtest_output=xml:/home/jenkins-agent/workspace/nightly_linux_release/ws/build/rclcpp_lifecycle/test_results/rclcpp_lifecycle/test_lifecycle_publisher.gtest.xml"
7: Working Directory: /home/jenkins-agent/workspace/nightly_linux_release/ws/build/rclcpp_lifecycle
7: Test timeout computed to be: 60
7: -- run_test.py: invoking following command in '/home/jenkins-agent/workspace/nightly_linux_release/ws/build/rclcpp_lifecycle':
7: - /home/jenkins-agent/workspace/nightly_linux_release/ws/build/rclcpp_lifecycle/test_lifecycle_publisher --gtest_output=xml:/home/jenkins-agent/workspace/nightly_linux_release/ws/build/rclcpp_lifecycle/test_results/rclcpp_lifecycle/test_lifecycle_publisher.gtest.xml
7: Running main() from /home/jenkins-agent/workspace/nightly_linux_release/ws/install/gtest_vendor/src/gtest_vendor/src/gtest_main.cc
7: [==========] Running 4 tests from 1 test suite.
7: [----------] Global test environment set-up.
7: [----------] 4 tests from PerTimerType/TestLifecyclePublisher
7: [ RUN ] PerTimerType/TestLifecyclePublisher.publish_managed_by_node/wall_timer
7: [WARN] [1715154999.943244122] [LifecyclePublisher]: Trying to publish message on the topic '/topic', but the publisher is not activated
7: [ERROR] [1715154999.943921662] [node]: Unable to start transition 6 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at /home/jenkins-agent/workspace/nightly_linux_release/ws/src/ros2/rcl/rcl/src/rcl/publisher.c:423, at /home/jenkins-agent/workspace/nightly_linux_release/ws/src/ros2/rcl/rcl_lifecycle/src/rcl_lifecycle.c:368
7: [WARN] [1715154999.943982879] [rclcpp_lifecycle]: Shutdown error in destruction of LifecycleNode: final state(inactive)
I think this happens because the LifecycleNode
destructor cannot assume that the publisher is still valid at this point. @fujitatomoya FYI, I may propose reverting that change for now.
@clalancette I figured out that I have a bit outdated versions of the other packages (about 3 weeks old). It might explain why test passes on my local setup.
I am trying to update the rclcpp
package and reproduce the failure again. Perhaps I will be able to dissect with which PR we interfere if the issue starts reproducing on my local setup.
I'll also point out https://github.com/eProsima/Fast-DDS/pull/4681 , which is suspiciously in this same area. I suspect maybe that one has a problem; if we fail again in CI, I'm going to try going earlier than that commit and seeing if that is the problem.
No luck in reproducing failures with the latest rclcpp on my local setup. @clalancette BTW I think that https://github.com/eProsima/Fast-DDS/pull/4681 is unrelated because you have a pinned 2.14.x version of the Fast-DDS in your gist
eProsima/Fast-CDR:
type: git
url: https://github.com/eProsima/Fast-CDR.git
version: 2.2.x
eProsima/Fast-DDS:
type: git
url: https://github.com/eProsima/Fast-DDS.git
version: 2.14.x
And i have the same version in my local setup.
Hm, from another hand, the rosbag2 tests failed on CI only with rmw_fastrtps_cpp
@clalancette BTW I think that eProsima/Fast-DDS#4681 is unrelated because you have a pinned 2.14.x version of the Fast-DDS in your gist
That's not pinned; it takes the latest from the 2.14.x branch.
@clalancette Ok this is certainly the Fast-DDS
package. I've just updated it locally to the latest master and got the same errors as on CI.
@clalancette It seems you were right about https://github.com/eProsima/Fast-DDS/pull/4681.
I've just checked-out locally revision c2a4523a21a94302298bc995057719546b72df2d
which is right before https://github.com/eProsima/Fast-DDS/pull/4681 and issues gone on my lcoal setup.
Now building with https://github.com/eProsima/Fast-DDS/pull/4681 to verify that it is a source of the failures.
@clalancette I confirm that https://github.com/eProsima/Fast-DDS/pull/4681 is a source of errors in rosbag2 Player with messages:
[ERROR] [1715197871.773078660] [rosbag2_player]: Failed to publish message on 'topic1' topic.
2: Error: failed to borrow loaned msg: error not set
2: [ERROR] [1715197871.872948366] [rosbag2_player]: Failed to publish message on 'topic1' topic.
2: Error: failed to borrow loaned msg: error not set
We can start preparing reverting PR. Don't need to wait until CI will finish.
Here's a PR to pin us back to a known working version of Fast-DDS: https://github.com/ros2/ros2/pull/1560
All right, CI in https://github.com/ros2/rmw_cyclonedds/pull/491#issuecomment-2101288484 was successful, and we've now pinned that version in both Rolling and Jazzy. With that, this is approved alongside https://github.com/ros2/rcl/pull/1156 , so I'm going to merge both of them and backport both to jazzy
. Thanks everyone for the work and reviews here.
@mergifyio backport jazzy
backport jazzy