Move worker native environment setup into a privileged task that runs on a base image with worker installed

petemoore commented 6 years ago

So I woke up with jetlag with this crazy idea in my head.

Currently our AMI generation process for generic-worker relies on mechanics provided by EC2 to bootstrap our instances. We have some magic to get logs from this process into taskcluster logs, and it is tricky for the process which snapshots the instance to produce the AMI to know whether the installation steps were successful or not. Also the code to set up the environment needs to install the worker itself, which means it is possible for this to go wrong and to produce AMIs for workers that don't actually have the worker installed and functioning on them.

This idea I had was pretty vague, but I'm dumping it here initially so that I can iterate on it, and others can join in with the conversation if they wish.

Imagine that instead we would bootstrap base images of a particular OS with generic-worker. When Bug 1439588 - Add feature to support running Windows tasks in an elevated process (Administrator) lands, we could introduce a mechanism that, given necessary scopes, a user could submit a job which installs packages / performs environment setup in a task directly as Administrator, and then cloud-specific mechanism that snapshots the instance and produces an image for the worker type. This could be used as a mechanism for people to customise the worker types.

The workflow could look something like this:

Base image worker types are created, running generic-worker, with nothing else installed on them. The worker type name may be something like "win2012r2-base" if it is say running Windows Server 2012 R2
A task is submitted to win2012r2-base with a scope-protected feature enabled ("features": [ "createWorkerType" ]) in the task payload ("scopes": ["generic-worker:create-worker-type:<provisioner>/<workerType>"]).
This task installs tools etc, and either resolves successfully if all went well, or fails if it did not.
The task could publish some artifacts with metadata about the changes it applied
Instead of the worker rebooting as normal after task completion, it simply shuts down.
Something external waits for the shutdown, takes an image, processes the task artifacts with the metadata, and creates (or updates) a worker type definition to use the newly generated image(s).

I'm not 100% sure about all this at the moment, this is very much a brainstorming exercise around the idea, but the objectives I was trying to achieve were:

1) To have some neutral mechanism that is cloud-independent, for executing install steps on a worker environment. Currently we rely on passing userdata and the ec2 config service reading it, and executing it. Using this proposal, we'd have a cloud-agnostic way of applying installation steps 2) It never sat well with me that defining a worker environment involved installing the generic-worker and configuring it too. This separates the two responsibilities rather well, so installation of toolchains etc is defined/implemented in a different context to the setup and management/configuration of the worker 3) It makes it simple to see whether an environment bootstrapping process was successful or not, since the task can simply resolve with success or failure depending on any checks it makes. Previously we had no easy way to communicate from the installation process that the bootstrapping process was successful or not, and so AMIs could get produced even if the installation steps had failed. This was because the process that created the AMIs ran on a separate machine and had no easy way to get context back from the process running in the ec2 config service. 4) It hopefully introduces a nice separation that will make it easier in future for us to move the environment bootstrapping in-tree into gecko. 5) It uses existing frameworks for authorizing who/what can create AMIs (taskcluster auth), track logs from the process (task artifacts), and meet demands of running at scale by running taskcluster tasks.

djmitche commented 6 years ago

How would this work in a redeployability scenario? That is, how would example.com use this to build their builders?

Related, Gecko is moving toward defining the workerTypes (and all sorts of roles, hooks, and so on) using information in the ci-admin and ci-configuration repositories, so rather than this process editing workerType definitions in the provisioner directly, it would be better for it to output information to ci-configuration, where it can then be applied using the usual ci-admin process.

bhearsum commented 3 months ago

Some discussion about this happened in https://github.com/mozilla-releng/releng-rfcs/issues/47, where an alternate (or perhaps precursor) idea is being talked about. I think there are a couple of points that are worth bringing over here explicitly, since they relate to this idea.

A comment from me, replying to another comment, expressing concern about adding image building to the critical path of builds/shipping:

Having full control of image building from within each project sounds wonderful! Is this something the TC team is planning to work towards? It might still be worth tackling some of this in the meantime as we're feeling the pain and need some kind of improvement here soon :) I'm not sure why this approach wouldn't scale however, to me it seems like it could almost be entirely automated other than needing to merge PRs, but perhaps I'm missing something. I agree that having full in-project control is the ideal, I guess I'm just struggling to see a concrete path that gets us there.

While not necessarily a blocker, a potential downside to doing this is that it would put image building in the critical path of running builds/tests. Obviously this is already the case for docker images - but depending on how much time it adds to the critical path it may have some significant downside. This is one of the big upsides about putting a reference to an already existing image in the tree - you gain the in-tree control over what things are built on without putting anything new in the critical path.

(I also agree that the ideal state is everything in the tree however.)

And a note from @moz-hwine about security boundaries changing:

/me notes that "decentralizing" some of this does change security boundaries, at least for the Fx CI case. I.e. RRA at some point, please.

bhearsum commented 3 months ago

(This is not meant to be stop energy, I just want to make sure that crucial points aren't missed as we continue talking about and planning image building improvements.)

taskcluster / taskcluster-rfcs

Move worker native environment setup into a privileged task that runs on a base image with worker installed #122