oxidecomputer / buildomat

a software build labour-saving device
Mozilla Public License 2.0
53 stars 2 forks source link

GitHub access tokens only last one hour, and this is not always long enough #19

Closed jclulow closed 1 year ago

jclulow commented 1 year ago

Today, Wollongong creates buildomat jobs in response to GitHub webhook signalling about Check Suites and Check Runs. Wollongong generates an ephemeral and read-only GitHub authentication token to include in the parameters for each job that is created, with the minimal access required to perform each CI job; i.e., to clone (but not modify) any private repositories to which the job should have access. These tokens come from the GitHub API and last for around one hour, before GitHub expires them. The lifetime cannot be extended or altered.

In the beginning, there was no support for complex pipelines with inter-job dependencies. Jobs would go straight from scheduling to being in the Queued state, until capacity emerged to execute them.

Now, we have pipelines of increasing depth, and some jobs in those pipelines are in contention for limited resources. Sometimes a job that needs to use its private token will not begin running until an hour after it was initially scheduled. This means the token that was included in the job may well have expired by the time we go to use it. If that job needed to clone private repositories or use authenticated access to the GitHub API, it will fail. Retrying just that job (by clicking Retry in the GitHub Check Run interface) will generally succeed, because we create a new token and include it in the newly scheduled job, which can often start right away.

As is a general theme in this area, we must paper over the deficiency in the GitHub API: it is not possible to request a token that expires after, say, 8 hours or 24 hours, as we would really like to do. Instead, we'll need to provide a new endpoint for the buildomat agent to use to request the ephemeral token once the job is executing. This request for a GitHub token would be authenticated by the ephemeral credentials that the agent in a given worker negotiates with the core buildomat server, and which are recycled after the job completes.

Unfortunately this is complicated by the existing architecture: the agent and the core API is a detail on a layer below Wollongong in the stack, but Wollongong is the only place with the credentials and the capability to generate the token. There are presently no calls from the core API up to Wollongong, and I would like to keep it that way if possible.

One possible design is a new temporary key-value dictionary attached to each job that we could allow the job owner to update over time. Wollongong could provide a token in that dictionary initially on job scheduling, and then continue to renew and replace it as it expires. When the job eventually runs, that key-value dictionary could be made generically available to the agent for parameter expansion in the job scripts, or to be injected into the environment, etc.

jclulow commented 1 year ago

I believe I have solved this now, through: