ros2 / rclcpp

rclcpp (ROS Client Library for C++)
Apache License 2.0
536 stars 417 forks source link

Timer callbacks can be delayed when using simulation time #2535

Open thomasmoore-torc opened 4 months ago

thomasmoore-torc commented 4 months ago

This is a breakout of the issue described in https://github.com/ros2/rclcpp/issues/2532#issuecomment-2125015383.

Timer processing when using simulation time is a two step process (e.g. requires two waits of the wait set), so it's possible that subscription callbacks can get called before the timer callback when using simulation time. As timers have higher priority than subscriptions (as documented in the Scheduling semantics), one may expect that timers driven via the /clock topic would also have that same higher-priority and expect that timer callbacks driven by the /clock topic would execute before subscription callbacks for messages that were transmitted after the /clock message. Because of the two-step process, this is not necessarily the case.

image

A potential change to consider would be to have the clock jump handler registered by the timer call the timer callback directly instead of triggering the guard condition. This would only be applicable if the clock thread has been disabled or the default callback group has been configured to be reentrant (though that may be confusing when using the SingleThreadedExecutor, but would actually be consistent with the execute_timers_separate_thread option to the EventsExecutor).

image

It should be noted that due to the issue described in https://github.com/ros2/rclcpp/issues/2532, an implementation of the above suggestion would likely require that the executor have an explicit processing step for the /clock topic subscription before processing other subscriptions to ensure that the timers are executed before the other subscriptions, Prior to Jazzy, the /clock subscription is always handled first because it was internally registered by the TimeSource before any other application subscriptions.

wjwwood commented 4 months ago

I was thinking about this use case and I believe relying on the processing order in the executors is fragile. Perhaps that's something we could ensure more in the future and therefore make this more reliable, but there are just so many implementation details here that must not change to keep this working, e.g. if the clock subscription was ever recreated sometime after start up, or some other seemingly innocuous change related to the clock topic.

Instead, I wonder if a customized executor that waits for a ROS time update before each wait-execute loop with the clock topic thread enabled might result in a better solution. That might even make sense as an executor option for the standard executors.

The reason I think so is that if you ever need to sleep_until or sleep_for in a callback you'll get into a dead lock without the clock topic thread, and I'm also worried about the executor waking up, handling the clock subscription first (as you desire in your above apporach), but then there are hundreds of clock messages which all get processed first and only after that do you start processing new messages from other subscriptions, all of which will "appear" to occur at the same time according to ros time (because while handling these the clock messages that are streaming in are not being handled). Or some other complex interaction with the clock topic.

thomasmoore-torc commented 4 months ago

I was thinking about this use case and I believe relying on the processing order in the executors is fragile.

I would agree with this statement for the standard executors. However, I think executing the timer callbacks in response to receipt of the /clock message within the EventsExecutor could be much more feasible.

Instead, I wonder if a customized executor that waits for a ROS time update before each wait-execute loop with the clock topic thread enabled might result in a better solution. That might even make sense as an executor option for the standard executors.

A potential concern with this is how only one message per subscription is processed per spin cycle. If there are multiple publishers to a topic and multiple messages arrive on that topic between /clock messages, then those messages could be lost if only one is processed in the spin cycle and then spinning is delayed until a clock update occurs.

The reason I think so is that if you ever need to sleep_until or sleep_for in a callback you'll get into a dead lock without the clock topic thread

This is a valid concern. My thought was that this shortcut wouldn't work if the clock thread was explicitly disabled because having a clock thread inherently ensures that there will always be a race condition between timer and subscription callbacks.

To your point that the standard executors have no guarantee of ordering, it's seeming like the EventsExecutor is definitely a better choice for this use case so I'll take a look at the TimersManager to see if the shortcut approach could make sense there.

wjwwood commented 3 months ago

A potential concern with this is how only one message per subscription is processed per spin cycle. If there are multiple publishers to a topic and multiple messages arrive on that topic between /clock messages, then those messages could be lost if only one is processed in the spin cycle and then spinning is delayed until a clock update occurs.

That's true, but if you made a custom executor then you could change that behavior. Though it's not as easy as it could be to do that. For example, for each subscription you can call take until it fails to take something if you want to "drain" a subscription before waiting again. But that has all kinds of potential issues, e.g. if you take longer to process a message than the period between messages then you'll end up never waiting again for the next clock message, among other subtle issues.

The style of the EventsExecutor can help with that because it lets you know how many times you need to take a message from a subscription for a given clock message period, but it has to contend with queue management (what happens when events occur faster than you can handle them, i.e. unbounded event queue growth). I can see three-ish approaches to deal with it, if you cannot ensure you don't get into a situation where events are coming faster than you can handle them (which is not easy to ensure generally, but might be possible in specific applications):

There are a bunch of ways I can imagine to manage these issues for a specific usecase, but you should be wary of just plugging in the EventsExecutor (or any other one) and assuming it has addressed the issue you have without introducing more. It's just a tool like the other executors, and you have to be careful on how you use it, especially when you are sensitive to the ordering of things.


To your point that the standard executors have no guarantee of ordering, it's seeming like the EventsExecutor is definitely a better choice for this use case so I'll take a look at the TimersManager to see if the shortcut approach could make sense there.

Depends on what you mean by "guarantee of ordering". Order of what? And with respect to what? (receipt time vs registration order vs something else, etc.)

Also, just to be clear, the EventsExecutor also does not guarantee the order amongst entities (e.g. order of execution between two subscriptions based on their order of registration or something else), instead the order is driven by order of (sometimes asynchronous) events and can be affected by things like network traffic or races between processes which are scheduled at the OS level, at least for things like inter-process subscriptions and services. It is also influenced by the order in which the underlying rmw implementation decides to order the events. This means the order could be different run-to-run. This is also why I disabled my new execution order tests for the subscription-EventsExecutor pair, because it's execution order is inherently non-deterministic. It was consistent (most of the time), that is to say it came in the same order run to run, but it wasn't deterministic because I had no way to guess what that order would be from the rclcpp level. And if I added more subscriptions, then the order would change in a non-predictable way, from what I could tell. But again that's only in the specific scenario I set up in my test.

However, with a custom version of the EventsExecutor you could try to ensure all the events for a given time period are executed in groups by entity, e.g. execute all events from S1 during t0...t1 and then execute all events from S2 between t0...t1 even though the events were originally interleaved. And that might be your best option. But if you always execute the events in order, then it will not be deterministic (it may appear regular, like FIFO for events or something, depending on the time between events, but it can still theoretically be impacted by the asynchronous nature of the events).

The SingleThreadedExecutor will always (after my patches or before Jazzy) execute exactly one message from the ready subscriptions in the order then were registered, regardless of the order in which they received the messages. Which I'm pretty sure you're clear on, but I wanted to state it again for completeness and others that might read this later.


One more thing to be wary of is the issue of stale events, which can occur if your history cache gets full and you have to replace old data and/or if you have lifespan enabled such that messages can expire after being received and thus removed from the history cache.

Consider an example with a subscription that uses KEEP_LAST with depth=10 (typical default setting), and that between spins you receive a burst of 11 messages but then no more (to keep it simple). That means you'll have 11 events in the event queue but only 10 messages in the history cache. Then lets say you get a new clock message (on the clock thread) and decide its time to execute some of the messages for those events, and you use the timestamps of the events to know that the first 10 messages of the 11 events are in the period you should execute, so you execute 10 times. Again, assuming no new messages have come in, you have 1 event in the queue and 0 messages in the history cache, so when you go to execute the next time period, you try to take a message for the one event, but there isn't one, and worse you've execute a message from this time period in the last one mistakenly.

Now that's an overly contrived example, but I think something like that could come up in a real life scenario and only gets more complicated when you have messages coming in while your handling messages etc...

So just be care to not assume that messages you are taking are the ones you expected based on the events. This is a complexity that the "wait set based" executors avoid by not trying to handle the situation at all, and instead only tells you that a subscription was ready in the past and that if you try to take a message from it then you're likely to get one. It specifically does not try to guarantee there will be a message (we gracefully handle a failed take, see: https://github.com/ros2/rclcpp/blob/8230d15ef7ffa9e824d7e0f9c623537ae331dc73/rclcpp/src/rclcpp/subscription_base.cpp#L213-L226) and it does not try to associate a waking wait set with a specific message (it could be woken by one message and replaced with another before you try to take anything). So just keep that in mind.


All that being said, I hope you figure out a way to make this work for your usecase. Any hopefully some of the above rambling can help you avoid pitfalls. 🙃

thomasmoore-torc commented 3 months ago

@wjwwood - thanks for taking the time to provide your thoughts. With much digging into ROS and rclcpp, I've come to many of the same conclusions you have presented.

However, with a custom version of the EventsExecutor you could try to ensure all the events for a given time period are executed in groups by entity, e.g. execute all events from S1 during t0...t1 and then execute all events from S2 between t0...t1 even though the events were originally interleaved.

The issue I'm running into with creating a custom executor that will provide any form of guaranteed ordering is that it currently can't work with simulation time because the time stamps in rmw_message_info_t are always system time. As such, if we're using ROS time for the timers, then it's difficult (if not impossible) to determine the correlation between the timer intervals and the message publish time.

One option is to require that every message have a header with a time stamp, but I'm not sure that it would be practical to handle that in an executor for several reasons. First, the executor deals with type-erased data, making it difficult for it to introspect the contents of the message. Perhaps a custom Subscription subclass could be created which the executor could interact with to obtain the time stamp. Second, the intra-process shortcut essentially bypasses the subscription class and uses its own mechanism to trigger a guard condition and call the subscription callback. This even further obscures the received messages from the executor.

Our current thinking is to build something between the subscription and the application, in a similar manner to message_filters. At that layer, we receive all of the subscribed messages and can do the necessary buffering to allow for providing the messages to the application in a deterministic fashion independent of any non-deterministic behavior in the layers below (e.g. network ordering, wait-set/queuing effects, etc.).

Consider an example with a subscription that uses KEEP_LAST with depth=10 (typical default setting), and that between spins you receive a burst of 11 messages but then no more (to keep it simple).

This is actually something I've been wondering about with the EventsExecutor but haven't had an opportunity to evaluate. I had assumed that the event queue could end up with extraneous events if the RMW message buffers overflowed, so it's good to know that's the expected behavior.

It specifically does not try to guarantee there will be a message

Since you mentioned this, it raises a question I have related to https://github.com/ros2/rclcpp/pull/1276. If understand correctly, take_data was added to Waitable because the intra-process buffer doesn't follow the same lack of guarantee as the Subscription class. Would it not have been an option to allow the intra-process buffer to follow the same pattern as Subscription and just not return a message of one wasn't available? I ask mostly because the added requirement to allocate a buffer so it can be passed back to execute on every message/event is kind of a bummer and we've been looking into ways to reduce/eliminate dynamic memory allocations.