Closed Yadunund closed 3 years ago
@mxgrey,
The only change as you pointed out, is to call _begin_next_task()
at the end of the set_queue()
function. I indented the rest of the set_queue()
function to unlock the _mutex
that is locked again by the _begin_next_task()
function.
Regarding how this solves the problem:
It ensures that RobotA
will immediately begin executing TaskA
when it is added to the queue. (And not wait for the timer callback). Once this task is moved to "active" state, it will no longer be in the queue for RobotA
. Thus, when a new TaskB
request is received before the timer callback executes, the fleet adapter will solve the task allocation problem for the set of tasks {TaskB}
as opposed to {TaskB, TaskA}
.
The method of checking each item in the assignments as proposed will not work as each robot has its own TaskManager
with a task queue. So RobotA
can be assigned TaskA
and when the reshuffle happens, RobotB
gets assigned TaskA
and TaskB
. But RobotB
's TaskManager
has never encountered TaskA
before (only RobotA's
TaskManager
has) so it will still treat it as a new assignment.
It ensures that RobotA will immediately begin executing TaskA when it is added to the queue.
This helps for this specific version of the race condition (which I suppose may be the most common manifestation of the race condition), but it doesn't help with other versions of the race condition, for example if TaskA
won't be ready to start until later and gets reassigned because the adapter is awarded a bid that it had submitted before TaskA
started to execute.
The method of checking each item in the assignments as proposed will not work as each robot has its own TaskManager with a task queue.
Right, so that filtering needs to be done before attempting to assign the tasks. In fact, the ideal system design would be to check if any of the tasks mentioned in the bid have been started (or finished, although that's admittedly improbable), and do a replan if they have. If the replan is materially different than the original bid, then the dispatcher should be notified so it can reassign the task if necessary. But that's a much more complex system, so for now I'd suggest just doing a replan without notifying the dispatcher.
Thanks for explaining that @mxgrey. I see how there are other situations where the race condition can arise.
I'll make the changes to replan right before assigning tasks if there is a task that has started execution. However, wouldn't there still be a chance for a race condition during the time it takes to replan and re-assign? 🤔
However, wouldn't there still be a chance for a race condition during the time it takes to replan and re-assign?
If task starts and task assignments are processed sequentially (which I believe they should be since we're using a single-threaded executor) then there shouldn't be a problem. In the worst case scenario, maybe one of the tasks included in the replanning was started while the replanning was still being calculated. To deal with that case, when we get the replanned assignments, we should once again check that none of the tasks in the assignment have been started or finished (and trigger yet another replanning if one has, now with the started task removed).
Since the task allocation planning should be able to finish on a relatively small timescale (seconds at worst?) the probability of repeatedly needing to replan because of tasks being started is very very low. Additionally, each time that edge case occurs, there are fewer tasks that need to be replanned, so the replanning would get faster and faster until (in the most extreme case) only one task remains.
What I'm less clear on is how the behavior might play out if a new bid award comes in during the replanning. Does the dispatcher guarantee that each bid has to finish before the next bid can begin. E.g. if the dispatcher receives TaskRequestA
and TaskRequestB
(almost) simultaneously, is it guaranteed to finish assigning TaskRequestA
before it begins the bid for TaskRequestB
, or will it open bids for TaskRequestA
and TaskRequestB
in parallel? If it's the latter, then I'll need to look much more carefully at how the bid awards are being handled in general.
Gotcha. Will implement some form of recursive checking of task assignments.
Does the dispatcher guarantee that each bid has to finish before the next bid can begin.
Correct me if i'm wrong @tanyouliang95 , but I believe the bid for TaskRequestB
will only be sent after the TaskRequestA
has been awarded.
@Yadunund Yes, the dispatcher does issue a DispatchRequest
before publishing a new BidNotice
. But since they are different topics, I'm not quite sure the msgs sequence is always guaranteed.
But since they are different topics, I'm not quite sure the msgs sequence is always guaranteed.
Taking this into account, the most robust design possible would be to wait for an acknowledgment from the award recipient before beginning the next task bid.
Sure thing. We can have the fleet adapter publish a DispatchAcknowledgement
message after it updates its assignments for an awarded bid. This message can contain a TaskID
and a boolean
field indicating whether the operation was successful.
Alternatively, the dispatcher can wait for the TaskSummary
message with state Queued
for the awarded task before beginning the bid for the next task.
@mxgrey ,
I've updated the bug fix to check for validity of assignments before overwriting the queues in the TaskManagers
. If the assignment set is no longer valid (one of the tasks has started execution), the assignments will be re-planned and re-validated. Queues are overwritten only when re-validation passes. If not, the process of re-planning and re-validating will be repeated until the assignments are valid.
The current
TaskManager
implementation has a timer which periodically (every 1 second) calls_begin_next_task()
. This function checks whether thedeployment time
for the front element of the task queue has passed and if so begins execution of the same (and then pops it from the queue).However when a list of tasks with
start_time
of "now" is submitted, a potential race condition may arise: Consider a fleet with two robots RobotA & RobotB. ABidProposal
forTaskA
is received and is eventually queued in theTaskManager
ofRobotA
. The task should technically begin execution but does not as the timer callback has not been called yet. (Still within 1s window) A secondBidProposal
forTaskB
is received. Here the task assignments will be replanned for a list comprising of the new task,TaskB
, and any tasks still in the task queues for each robot in the fleet. So in this case the set of tasks{TaskA, TaskB}
will be planned for. Assume the assignment returns such that[{RobotB: {TaskB, TaskA}}, {RobotA:{}}]
, ie, RobotB is assigned to perform TaskB followed by TaskA and RobotA is not assigned any of the tasks. Now before the task queues of each robot are overwritten with the new assignments (ie, while waiting for theDispatchResult
message from the Dispatcher node), the timer for RobotA executes (it has been 1s since "now") and so RobotA begins TaskA. (which has now technically been reassigned to RobotB). Then when theDispatchResult
message arrives, the tasks queues are finally overwritten (too late). As a resultTaskA
will be performed twice: Once byRobotA
right now, and again byRobotB
after it finishesTaskB
.This PR introduces a stopgap fix for the problem described above.