Fix fuse optimizer unit test

svwilliams commented 6 months ago

Fix fuse optimizer unit test. The rclcpp::wait_for_message call was throwing an exception 'subscription already associated with a wait set'. I have no idea why. My instinct is that the node in this unit test is using a single-threaded executor with a dedicated spin thread, and that maybe the rclcpp::wait_for_message is also trying to spin that same subscriber? :shrug:

I switched the test to use a normal subscriber + callback function. It's not the prettiest thing, but it's not terrible either. I also needed to add a delay between publishing pose_msg1 and pose_msg2. Again, I have no idea why this is needed. I have the publisher queue set to 5, and the pose subscriber has a default queue size of 10. I'm not sure why the first message is getting lost.

ayrton04 commented 6 months ago

Looks like you're not the only one:

https://github.com/ros-controls/ros2_controllers/issues/1101

Slightly different use case, but the net result is the same. Looking at other issues linked there, I see a comment from William Woodall:

https://github.com/ros2/rclcpp/pull/2142#issuecomment-2078297424

It feels like there has to be some kind of dangling reference to the sub somewhere. You are using this overload:

https://github.com/ros2/rclcpp/blob/2cd8900dd20713f54a2a82d189a5e3372dca70be/rclcpp/include/rclcpp/wait_for_message.hpp#L96

The sub should go out of scope, but if something else is keeping a copy of the shared pointer, then it will obviously persist.

Your solution is obviously find, but I wonder if you used the other overload, which forces you to create the sub yourself, then you could manually call reset() on the sub shared pointer within the loop every time. Shouldn't be necessary, but maybe it would work? If you did use that overload, you could also call use_count on the sub pointer to see if someone is maintaining a reference to it somewhere.

ayrton04 commented 6 months ago

Maybe? It's hard to know if that test is throwing a fault on the first call to wait_for_message or a subsequent call. It's in a loop, so I am wondering if it's the second call that results in the error.

svwilliams commented 6 months ago

Just for fun I added some tracing print statements. It looks like the error throws on the very first call to wait_for_message().

    [test_fixed_lag_ignition-1] Waiting for message (0)...
    [test_fixed_lag_ignition-1] terminate called after throwing an instance of 'std::runtime_error'
    [test_fixed_lag_ignition-1]   what():  subscription already associated with a wait set
    [test_fixed_lag_ignition-1] Aborted (core dumped)

svwilliams commented 6 months ago

But as Tom/Carlos suggest:

Use the other overload that forces me to create the subscriber manually

Create that subscriber with the automatically_add_to_executor_with_node parameter set to false (which is undocumented)

while ((odom_msg.header.stamp != rclcpp::Time(3, 0,
RCL_ROS_TIME)) && (node->now() < result_timeout))
{
rclcpp::SubscriptionOptions opts;
opts.callback_group =
  node->create_callback_group(rclcpp::CallbackGroupType::MutuallyExclusive, false);
auto sub = node->create_subscription<nav_msgs::msg::Odometry>(
  "/odom", 1, [](const std::shared_ptr<const nav_msgs::msg::Odometry>) {}, opts);
rclcpp::wait_for_message(odom_msg, sub, node->get_node_options().context(),
  std::chrono::seconds(1));
}

This does work.

However, now that I see what is happening under the hood, I'm not in love with the loop creating a subscriber every time. That seems like it could lead to missed/dropped messages? If the external node published a message at just the wrong time?

svwilliams commented 6 months ago

I think I'm going to leave the unit test as-is -- create a normal subscriber and wait for the expected message. This feels like it requires far less explanation for how the test works.

But I did realize I introduced a threading problem. The last commit adds a mutex to guard access to the stored odometry message.

svwilliams commented 6 months ago

Checking if adding a comment will trigger the upstream CI build to run

svwilliams commented 6 months ago

I don't understand. The last PR magically triggered CI jobs after I added a comment. This one refuses. Merging anyway.

locusrobotics / fuse

Fix fuse optimizer unit test #369