ros2 / rmw_zenoh

RMW for ROS 2 using Zenoh as the middleware
Apache License 2.0
184 stars 34 forks source link

Fix a race condition in rmw_wait. #211

Closed clalancette closed 3 months ago

clalancette commented 3 months ago

In the current code, there is a race condition in rmw_wait. This happens because of the following sequence:

  1. Without taking a lock, we check if any of the entities in the wait_set are ready, and if not attach the condition_variable to it.
  2. Then we take the lock, and go to sleep on the condition_variable.

However, doing step 1 takes time, and we check in a particular order: guard_conditions, events, subscriptions, services, clients. The race happens because a subscription may come in after we've checked subscriptions in the above list (while we are checking services and clients). In that case, we'll unnecessarily go to sleep on the condition_variable, even though something is ready. This increases the latency.

To solve the issue, we still do all of the checking and attaching unlocked. However, we update the "notify" method of each entity so that it takes the condition_variable lock, and in addition to kicking the condition_variable it sets a boolean in the wait_set structure to "true". Then, in rmw_wait(), after we have finished attaching, we take the lock, and set a predicate on the condition_variable so that it will quit if wait_set_data->triggered is true. Thus, if one of the entities became ready after we checked, we'll notice it and not go to sleep.

This also allows us to deal with "spurious" wakeups, where the condition_variable was woken up even though nothing in this wait_set became ready.

clalancette commented 3 months ago

I should also mention the invaluable thoughts of @wjwwood and @mjcarroll when coming up with this.

clalancette commented 3 months ago

Ready for another round of review!