taskcluster / taskcluster-rfcs

Taskcluster team planning
Mozilla Public License 2.0
11 stars 19 forks source link

github repo specific schedulerId #143

Open petemoore opened 5 years ago

petemoore commented 5 years ago

This issue has been extracted (and slightly rewritten) from this comment of issue #16.

Tasks created by taskcluster-github have "schedulerId": "taskcluster-github".

To cancel a taskcluster-github-created task requires scope queue:cancel-task:taskcluster-github/<taskGroupId>/<taskId>. Since taskGroupId and taskId do not follow a repo-specific naming pattern, the scope queue:cancel-task:taskcluster-github/* is the only scope assignment that serves the general purpose of being able to cancel any taskcluster-github task for a given repo, without the possibility to restrict this to an individual github repo.

By using unique github scheduler ids per repo, this limitation would be lifted. If tasks created for repo github.com/foo/bar were to have (e.g.) "schedulerId": "github-foo-bar", then to cancel a task, a client would need to have queue:cancel-task:github-foo-bar/<taskGroupId>/<taskId> rather than queue:cancel-task:taskcluster-github/<taskGroupId>/<taskId> so it would be relatively straightforward to grant queue:cancel-task:github-foo-bar/* to roles/clients that should be able to cancel any task for only this repo. They would then not be able to cancel tasks for other github repos, as they currently can now.

Note, one complication is that schedulerIds are currently limited to ^([a-zA-Z0-9-_]*)$ with a maximum limit of 38 chars, so the github org/user + repository name cannot be simply embedded in the schedulerId since this will not necessarily comply with the required schedulerId pattern. We should therefore define the schedulerId as a function of the org/user and repo name, that satisfies the following properties:

  1. (Required) It always returns a schedulerId that conforms to the required regexp for schedulerId.
  2. (Required) It returns a schedulerId that is unique per repo.
  3. (Preferred) The github org/user and repo name are reasonably easy to determine from the schedulerId (i.e. the function is reverse-engineerable), or if not, it is a simple and well-defined lexical function that users could implement themselves to predict the schedulerId in any tooling they may wish to create.

One example of such a function (in this illustration written in go) could be the schedulerId function below:

import (
    "crypto/sha256"
    "fmt"
)

func schedulerId(userOrOrg, repoName string) string {

    qualifiedRepo := stripASCII(userOrOrg) + "-" + stripASCII(repoName)
    if len(qualifiedRepo) <= 35 {
        return "gh-" + qualifiedRepo
    }
    return "gh-" + qualifiedRepo[0:30] + hash(qualifiedRepo)[0:5]
}

func hash(orig string) (hashed string) {
    return fmt.Sprintf("%x", sha256.Sum256([]byte(orig)))
}

func stripASCII(orig string) (stripped string) {
    for _, char := range orig {
        if (char >= '0' && char <= '9') || (char >= 'a' && char <= 'z') || (char >= 'A' && char <= 'Z') {
            stripped += string(char)
        }
    }
    return
}
djmitche commented 5 years ago

The character limit is 38 now.

Note that there is another threat to uniqueness: we want to prevent, for example, someone creating an org named taskcluster-generic and a repo named worker and getting the same schedulerId as taskcluster/generic-worker.

djmitche commented 5 years ago

Oops, I thought I marked for @owlishDeveloper's review but it's not a PR. Anyway, please take a look!