Closed escapewindow closed 4 years ago
(assigning to self so I take a look and respond.. apologies for missing this!)
We definitely want to figure this out. We're also working to replace a lot of the core functionality of the Queue, which will have a profound impact on what is possible and practical. From what I remember of the original PR, what works now fulfills the needs of CoT, but the problem was about the level of support for the rerunTask endpoint going forward.
What I'd like to propose is that we commit to leaving rerunTask as it exists until we get to the point in the queue work that we need to figure out what to do. I'll make sure to involve @escapewindow and anyone else in Releng in the discussions about these endpoints.
Does that sound reasonable?
(assigning to self so I take a look and respond.. apologies for missing this!)
No worries.
What I'd like to propose is that we commit to leaving rerunTask as it exists until we get to the point in the queue work that we need to figure out what to do. I'll make sure to involve @escapewindow and anyone else in Releng in the discussions about these endpoints.
Does that sound reasonable?
Sounds good to me! Agreed, there isn't as much urgency in adding these endpoints as long as the rerunTask
endpoint is available.
Semi-related to this proposal: we were discussing the confusion between rerun
and retrigger
; I'm leaning towards renaming the retrigger
action to clone task(s)
to be more explicit about what it does.
We've largely dealt with the issues behind this issue, by:
force: true
, andtaskcluster task rerun --force
option in taskcluster cli.
(discussion moved from https://github.com/taskcluster/taskcluster-queue/pull/292, re:
queue.rerunTask
deprecation.)retrigger
s are copies of the task under a newtaskId
and a newtaskGroupId
. They can still be CoT-verifiable, if they've been created through a CoT-verifiable retrigger action. Retriggers are preferred for tests because we sometimes create multiple runs of the same task to weed out intermittent failures.For Firefox release graphs, we prefer
rerun
s, which increment therunId
of the current task. There are some dangers here, largely around rerunning a successfully completed task, or the fact that reruns of a task will start the task, even if its dependencies haven't completed yet. (The latter is a bug that is also a feature; we use this feature when we intentionally want to start a task before its dependencies complete.)In releases, we need the original
taskId
status to be definitive. We monitor releases by monitoring thetaskGroupId
. Retriggers would create a separatetaskGroupId
and complicate this. Even if they created the new task inside the sametaskGroupId
, we'd have two separatetaskId
s for the same task label, and seeing 10 failed tasks in thetaskGroupId
wouldn't necessarily mean we had 10 actionable tasks; we'd also have to check they didn't exist in a separate, newertaskId
. Also, the way we trigger the nextphase
of the release (build -> promote -> push -> ship) is to point at the previous action task'slabel-to-taskid.json
to find the previous tasks to depend on. ThesetaskId
s are the originaltaskId
s, so any downstream graph would be depending on failed tasks even if we had retriggered them to success previously.Because reruns are so important to our current Firefox release process, I would argue that we either shouldn't deprecate
queue.rerunTask
, or we should replace it with methods that replace its functionality but fix its issues. For example,requeueBrokenTask
(or something) could mark a task asunscheduled
, but only if its status isexception
orfailed
. Marking it asunscheduled
would (I think) make it still block on unresolved dependencies.forceRerunTask
could mark any task aspending
, regardless of task state or dependency state. A split like the above, could allow for more granular permissioning, and help avoid footguns.This RFC is to track adding the above two api methods. Once these are added and supported in the tooling, we can switch everything over and eventually remove
queue.rerunTask
.