Add a timeout before automatically releasing lift

mxgrey commented 5 months ago

It's been discovered that a race condition can cause a deadlock when lift session usage is combined with lane mutexes.

If a robot locks the lane mutexes for the lift and begins a lift session but then a replan occurs before the robot enters the lift, there is a narrow window where the robot might automatically drop the lift session. If another robot is also summoning the lift simultaneously then that other robot could manage to take over the lift session. However the first robot will still be holding the lane mutexes. At that point, one of the robots will be waiting to lock the lift session while the other will be waiting to lock the lane mutex.

The automatic dropping of the lift session happens because of a blunt force mechanism that tries to identify when a robot is holding onto a lift session without really needing it. When a replan occurs while the robot is outside of the lift, it creates a very narrow window where that blunt force mechanism will pick up a false positive and trigger the release. This PR attempts to soften that mechanism by requiring a 30 second window to pass before doing the release automatically. This ensures that a situation where a quick replan occurs will not trigger the mechanism.

Remaining Issues / Rationale:

In theory if something else blocks up the fleet adapter for more than 30 seconds, then the issue can still happen. However, if the fleet adapter is blocked up for 30 seconds then something much worse than this is happening and the system will likely need intervention anyway.
We do not generally rely on this automatic release mechanism to end lift sessions. There are two other finer tuned mechanisms for knowing when a lift should be released, and the mechanism affected by this PR is only meant to be a desperate last resort. Adding a 30 second delay to mitigate harmful unintended effects seems like a reasonable measure before resorting to this mechanism.
I've considered whether we can introduce a blunt force mechanism to catch and resolve this deadlock situation in general, but every approach I can think of carries a risk of introducing other deadlocks due to race conditions in other situations.

This case will be taken into serious consideration as we work on the next generation traffic + resource locking mechanisms.

cwrx777 commented 3 months ago

is this PR good to merge to main?

mxgrey commented 3 months ago

Thanks for the reminder @cwrx777

open-rmf / rmf_ros2

Add a timeout before automatically releasing lift #369