Document/Clarify executor processing order

tobiasblass commented 5 years ago

We recently investigated the callback execution order of the default (single-threaded) executor, and found some very surprising behavior which might or might not be intended. The execution order of the executor is very interesting for real-time and control software, as you would like to control or at least predict the latency between the topic update and the completion of the associated callback.

All of these properties are not necessarily bad, but are surprising and (as far as I can see) undocumented. I think a good first step here is to document this behaviour (I would be happy to send a pull request documenting this in the source if you are interested), and then to discuss whether this is the intended behavior or a bug.

There are three properties of the execution order we were surprised by: (Note: for brevity, I'll only mention subscriptions and topics below. This equally holds for clients and services).

The rmw layer is not considered unless the executor is currently idle.

If you take a look at get_next_subscription, for example, it only considers subscriptions that are in the this->subscriptionhandles list. This list is filled by collect_entities, which is in turn called by Executor::wait_for_work. As the name suggests, this function is only called if no work is available, and this is the only function that communicates with rmw and thus the DDS backend.

In practice, this means that messages cannot be processed until all messages picked up during the last wait_for_work have been processed. If one wait_for_work call found a large amount of long-running callbacks, anything arriving (even shortly) later has to wait until all of these are completed. (Note that this does not hold for timers, since the memory_strategy is not involved there. They are able to run as soon as they trigger).

messages are prioritized by message type

The next property is already described by my colleague in this issue. In short, the updates picked up during a wait_for_work call are implicitly prioritized first by type (timers, then subscriptions, then services, then clients, then waitables) and then by registration order.

Only one message instance per topic can be processed per wait_for_work

rmw_wait (and thus rcl_wait and the executor) only return whether a topic has new messages, not how many. This means, that only one update per topic is processed per wait_for_work call. You can see this nicely here, where the detected subscription is removed from the subscriptionhandles list without checking whether there are more updates pending.

In practice, this means that if you have multiple updates for the same topic in quick succession, you require that many wait_for_work calls to process them all. In combination with the previous property, this means that the later updates in a burst might have to wait for updates on multiple other topics, even if these arrive after the burst.

nabeelsherazi commented 3 years ago

Has this been written anywhere? Our team is interested in this as well, and would love to see any documentation available on the executors.

clalancette commented 3 years ago

As it stands, we don't have anything written down that substantially documents this. It's on our long-term list to get done, but if you'd like to dig into it we'd be happy to review documentation for it.

ros2 / rclcpp