open-rmf / rmf_ros2

Internal ROS infrastructure for RMF
Apache License 2.0
74 stars 60 forks source link

Fix race condition for ingesting/dispensing and disable uncrustify tests by default #362

Closed Yadunund closed 5 months ago

Yadunund commented 5 months ago

359 is blocked by uncrustify tests and #361 attempted to lint the code to fix the style checks however it became clear that having a single codebase pass uncrustify on Jammy and Noble is not feasible. Since we want to keep CI on main compatible with Humble and Rolling, the best option is to disable uncrustify checks in testing.

Yadunund commented 5 months ago

test_rmf_task_ros2 is failling on Rolling.

luca-della-vedova commented 5 months ago

I am digging at this, results so far:

This happens in Rolling but no on Jazzy. The error is actually an rclcpp exception:

FAILED:
cannot store a negative time point in rclcpp::Time
at /usr/local/google/home/lucadv/noble_ws/src/rmf/rmf_ros2/rmf_task_ros2/test/bidding/test_SelfBid.cpp:152

I'll keep probing to see where this is done

luca-della-vedova commented 5 months ago

Addressed in https://github.com/open-rmf/rmf_ros2/pull/362/commits/c2a519db25cf5fa2e8475c378df926a34e60e0ba (at least locally, hopefully CI doesn't prove me wrong). It seems that using std::chrono::steady_clock::time_point::min() created the issue so I replaced min and max with "small" and "large" values, at the end of the day if I understand correctly for the estimator to work and the test to be successful it should be OK as long as one is larger than the other by some amount for QuickestFinishEvaluator to return the correct result. Now whether this masks some other issue, that I can't tell so I'll leave it up to you @Yadunund to comment on whether you are OK with this fix or you want to dig into why we were creating negative time points when the cost was min()

mxgrey commented 5 months ago

@luca-della-vedova I was also looking into the rmf_task_ros2 failure, but you narrowed down the cause faster than me. Your fix makes perfect sense as long as the user doesn't have their system clock set to a time before the year 1970.

To be honest I find it extremely questionable that rclcpp is throwing exceptions for negative time values. Even if they want to believe that no one intends to simulate a time point before the Unix epoch, throwing an exception is an awfully blunt instrument to use for data sanitization.

luca-della-vedova commented 5 months ago

More background on the choice is here

luca-della-vedova commented 5 months ago

Ah well now rmf_fleet_adapter fails

mxgrey commented 5 months ago

It turns out there's been an extremely subtle race condition bug in the ingesting and dispensing phase implementations. I think I've seen this show up as a flaky test in the past, but it used to be so extremely rare that it didn't seem worth investigating. Something about jazzy seems to be causing the race condition to be triggered much more easily.

The logic is fixed now by 6d1e2c9. The key change is, we should always read ingestor/dispenser responses that match our request, no matter what time they are coming in relative to other messages. But we should still account for the timing of that response message when considering state updates.

The change also now allows messages with the exact same time values to be considered as a pair, whereas previously it would have forced the phase to wait for a follow-up message.

Yadunund commented 5 months ago

@mxgrey your push did the trick ✨