add `queue.requeueBrokenTask` and `queue.forceRerunTask`

escapewindow commented 6 years ago

(discussion moved from https://github.com/taskcluster/taskcluster-queue/pull/292, re: queue.rerunTask deprecation.)

retriggers are copies of the task under a new taskId and a new taskGroupId. They can still be CoT-verifiable, if they've been created through a CoT-verifiable retrigger action. Retriggers are preferred for tests because we sometimes create multiple runs of the same task to weed out intermittent failures.

For Firefox release graphs, we prefer reruns, which increment the runId of the current task. There are some dangers here, largely around rerunning a successfully completed task, or the fact that reruns of a task will start the task, even if its dependencies haven't completed yet. (The latter is a bug that is also a feature; we use this feature when we intentionally want to start a task before its dependencies complete.)

In releases, we need the original taskId status to be definitive. We monitor releases by monitoring the taskGroupId. Retriggers would create a separate taskGroupId and complicate this. Even if they created the new task inside the same taskGroupId, we'd have two separate taskIds for the same task label, and seeing 10 failed tasks in the taskGroupId wouldn't necessarily mean we had 10 actionable tasks; we'd also have to check they didn't exist in a separate, newer taskId. Also, the way we trigger the next phase of the release (build -> promote -> push -> ship) is to point at the previous action task's label-to-taskid.json to find the previous tasks to depend on. These taskIds are the original taskIds, so any downstream graph would be depending on failed tasks even if we had retriggered them to success previously.

Because reruns are so important to our current Firefox release process, I would argue that we either shouldn't deprecate queue.rerunTask, or we should replace it with methods that replace its functionality but fix its issues. For example, requeueBrokenTask (or something) could mark a task as unscheduled, but only if its status is exception or failed. Marking it as unscheduled would (I think) make it still block on unresolved dependencies. forceRerunTask could mark any task as pending, regardless of task state or dependency state. A split like the above, could allow for more granular permissioning, and help avoid footguns.

This RFC is to track adding the above two api methods. Once these are added and supported in the tooling, we can switch everything over and eventually remove queue.rerunTask.

djmitche commented 6 years ago

(assigning to self so I take a look and respond.. apologies for missing this!)

jhford commented 6 years ago

We definitely want to figure this out. We're also working to replace a lot of the core functionality of the Queue, which will have a profound impact on what is possible and practical. From what I remember of the original PR, what works now fulfills the needs of CoT, but the problem was about the level of support for the rerunTask endpoint going forward.

What I'd like to propose is that we commit to leaving rerunTask as it exists until we get to the point in the queue work that we need to figure out what to do. I'll make sure to involve @escapewindow and anyone else in Releng in the discussions about these endpoints.

Does that sound reasonable?

escapewindow commented 6 years ago

(assigning to self so I take a look and respond.. apologies for missing this!)

No worries.

What I'd like to propose is that we commit to leaving rerunTask as it exists until we get to the point in the queue work that we need to figure out what to do. I'll make sure to involve @escapewindow and anyone else in Releng in the discussions about these endpoints.

Does that sound reasonable?

Sounds good to me! Agreed, there isn't as much urgency in adding these endpoints as long as the rerunTask endpoint is available.

Semi-related to this proposal: we were discussing the confusion between rerun and retrigger; I'm leaning towards renaming the retrigger action to clone task(s) to be more explicit about what it does.

escapewindow commented 4 years ago

We've largely dealt with the issues behind this issue, by:

adding logic to the in-tree rerun and retrigger tasks to prevent the wrong behavior unless we set force: true, and
adding a taskcluster task rerun --force option in taskcluster cli.

taskcluster / taskcluster-rfcs

add `queue.requeueBrokenTask` and `queue.forceRerunTask` #129