osrf / rmf_core

Provides the centralized functions of RMF: scheduling, etc.
Apache License 2.0
102 stars 41 forks source link

Address bug with task queue #265

Closed Yadunund closed 3 years ago

Yadunund commented 3 years ago

The current TaskManager implementation has a timer which periodically (every 1 second) calls _begin_next_task(). This function checks whether the deployment time for the front element of the task queue has passed and if so begins execution of the same (and then pops it from the queue).

However when a list of tasks with start_time of "now" is submitted, a potential race condition may arise: Consider a fleet with two robots RobotA & RobotB. A BidProposal for TaskA is received and is eventually queued in the TaskManager of RobotA. The task should technically begin execution but does not as the timer callback has not been called yet. (Still within 1s window) A second BidProposal for TaskB is received. Here the task assignments will be replanned for a list comprising of the new task, TaskB, and any tasks still in the task queues for each robot in the fleet. So in this case the set of tasks {TaskA, TaskB} will be planned for. Assume the assignment returns such that [{RobotB: {TaskB, TaskA}}, {RobotA:{}}], ie, RobotB is assigned to perform TaskB followed by TaskA and RobotA is not assigned any of the tasks. Now before the task queues of each robot are overwritten with the new assignments (ie, while waiting for the DispatchResult message from the Dispatcher node), the timer for RobotA executes (it has been 1s since "now") and so RobotA begins TaskA. (which has now technically been reassigned to RobotB). Then when the DispatchResult message arrives, the tasks queues are finally overwritten (too late). As a result TaskA will be performed twice: Once by RobotA right now, and again by RobotB after it finishes TaskB.

This PR introduces a stopgap fix for the problem described above.

Yadunund commented 3 years ago

@mxgrey,

The only change as you pointed out, is to call _begin_next_task() at the end of the set_queue() function. I indented the rest of the set_queue() function to unlock the _mutex that is locked again by the _begin_next_task() function.

Regarding how this solves the problem: It ensures that RobotA will immediately begin executing TaskA when it is added to the queue. (And not wait for the timer callback). Once this task is moved to "active" state, it will no longer be in the queue for RobotA. Thus, when a new TaskB request is received before the timer callback executes, the fleet adapter will solve the task allocation problem for the set of tasks {TaskB} as opposed to {TaskB, TaskA}.

The method of checking each item in the assignments as proposed will not work as each robot has its own TaskManager with a task queue. So RobotA can be assigned TaskA and when the reshuffle happens, RobotB gets assigned TaskA and TaskB. But RobotB's TaskManager has never encountered TaskA before (only RobotA's TaskManager has) so it will still treat it as a new assignment.

mxgrey commented 3 years ago

It ensures that RobotA will immediately begin executing TaskA when it is added to the queue.

This helps for this specific version of the race condition (which I suppose may be the most common manifestation of the race condition), but it doesn't help with other versions of the race condition, for example if TaskA won't be ready to start until later and gets reassigned because the adapter is awarded a bid that it had submitted before TaskA started to execute.

The method of checking each item in the assignments as proposed will not work as each robot has its own TaskManager with a task queue.

Right, so that filtering needs to be done before attempting to assign the tasks. In fact, the ideal system design would be to check if any of the tasks mentioned in the bid have been started (or finished, although that's admittedly improbable), and do a replan if they have. If the replan is materially different than the original bid, then the dispatcher should be notified so it can reassign the task if necessary. But that's a much more complex system, so for now I'd suggest just doing a replan without notifying the dispatcher.

Yadunund commented 3 years ago

Thanks for explaining that @mxgrey. I see how there are other situations where the race condition can arise.

I'll make the changes to replan right before assigning tasks if there is a task that has started execution. However, wouldn't there still be a chance for a race condition during the time it takes to replan and re-assign? 🤔

mxgrey commented 3 years ago

However, wouldn't there still be a chance for a race condition during the time it takes to replan and re-assign?

If task starts and task assignments are processed sequentially (which I believe they should be since we're using a single-threaded executor) then there shouldn't be a problem. In the worst case scenario, maybe one of the tasks included in the replanning was started while the replanning was still being calculated. To deal with that case, when we get the replanned assignments, we should once again check that none of the tasks in the assignment have been started or finished (and trigger yet another replanning if one has, now with the started task removed).

Since the task allocation planning should be able to finish on a relatively small timescale (seconds at worst?) the probability of repeatedly needing to replan because of tasks being started is very very low. Additionally, each time that edge case occurs, there are fewer tasks that need to be replanned, so the replanning would get faster and faster until (in the most extreme case) only one task remains.

What I'm less clear on is how the behavior might play out if a new bid award comes in during the replanning. Does the dispatcher guarantee that each bid has to finish before the next bid can begin. E.g. if the dispatcher receives TaskRequestA and TaskRequestB (almost) simultaneously, is it guaranteed to finish assigning TaskRequestA before it begins the bid for TaskRequestB, or will it open bids for TaskRequestA and TaskRequestB in parallel? If it's the latter, then I'll need to look much more carefully at how the bid awards are being handled in general.

Yadunund commented 3 years ago

Gotcha. Will implement some form of recursive checking of task assignments.

Does the dispatcher guarantee that each bid has to finish before the next bid can begin.

Correct me if i'm wrong @tanyouliang95 , but I believe the bid for TaskRequestB will only be sent after the TaskRequestA has been awarded.

youliangtan commented 3 years ago

@Yadunund Yes, the dispatcher does issue a DispatchRequest before publishing a new BidNotice. But since they are different topics, I'm not quite sure the msgs sequence is always guaranteed.

mxgrey commented 3 years ago

But since they are different topics, I'm not quite sure the msgs sequence is always guaranteed.

Taking this into account, the most robust design possible would be to wait for an acknowledgment from the award recipient before beginning the next task bid.

Yadunund commented 3 years ago

Sure thing. We can have the fleet adapter publish a DispatchAcknowledgement message after it updates its assignments for an awarded bid. This message can contain a TaskID and a boolean field indicating whether the operation was successful.

Alternatively, the dispatcher can wait for the TaskSummary message with state Queued for the awarded task before beginning the bid for the next task.

Yadunund commented 3 years ago

@mxgrey , I've updated the bug fix to check for validity of assignments before overwriting the queues in the TaskManagers. If the assignment set is no longer valid (one of the tasks has started execution), the assignments will be re-planned and re-validated. Queues are overwritten only when re-validation passes. If not, the process of re-planning and re-validating will be repeated until the assignments are valid.