Worker pool image automation

ahal commented 6 months ago

Following a monopacker cross-training session with aerickson, Ben and I had a chat around potential avenues for automation. I wanted to jot down some of the ideas while they are fresh in my mind. We can figure out how to put them into action via proper RFCs later.

I'll try to order them from easiest to wildest.

Automate monopacker builds

One of the pain points aerickson mentioned was that it was difficult to tell if you break another build while working on a current one (as they often use the same scripts). A simple first step could be to have tasks that build each image definition (without publishing). Then it's clear when something breaks and what broke it.

Automate image dependency upgrades

The next step could be to have cron tasks that look for new versions of certain dependencies. Things like Taskcluster / generic-worker / worker-runner etc. These could run and publish the images and we can update pools to them at our leisure.

aerickson mentioned image storage as a potential concern. May need a strategy for cleaning up old unused images.

Automate worker pool upgrades

The next step would be to automate using these images in the various pools. We could have a cron task that runs out of ci-config that looks for new images and then creates a pull request to update them (bear in mind ci-config is moving to Github).

I don't think we would want these changes to go live automatically, but automated PRs or phab revisions would be very welcome!

In-repo image upgrades

This is the pinnacle of automation. The same cron task from the previous section (in ci-config) would run and look for newer versions of the images. Except in this case, the images are no longer defined in ci-config but rather in-repo. Or rather, the images are defined in ci-config, but the image a pool uses would be defined in-repo.

The cron task would iterate through all projects, look at what images the repo is using, and if there is a newer one, create a PR to update it. Maintainers for each repo could decide to merge or close the PR.

This has many benefits:

Image updates can be backed out if they cause problems
Image updates can be tested directly in pull requests for the various repos
Different repos can more easily use different versions of the image

It's worth noting that Gecko won't be using pull requests, so we'd need to submit a phab revision in that case.

ahal commented 6 months ago

Also, we can probably apply similar ideas to our Azure images.

ahal commented 6 months ago

To expand a bit on how "in-repo image upgrades" would work, I think ci-config would need to generate pools for every available image, as well as a latest pool that always points at the most recent image. Then repos would simply set the worker-type to either latest or one with a date in the name. At that point, you update images by updating pools.

To avoid too many unused pools lying around, we could have a check that inspects what pools each configured project is using, and warn when there are unused pools / images.

petemoore commented 6 months ago

i don't this this approach scales, and gives too much control to ci-config which is outside of project authority. I would prefer an approach where projects had full autonomy. Something like how Dockerfiles in tree are used for building docker images, under the full autonomy of the development team. I think we need to provide apis and services that allow teams to build their own images, rather than own the image configurations ourselves and allow people to use what we created.

petemoore commented 6 months ago

something along the lines of https://github.com/taskcluster/taskcluster-rfcs/issues/122

ahal commented 6 months ago

Having full control of image building from within each project sounds wonderful! Is this something the TC team is planning to work towards? It might still be worth tackling some of this in the meantime as we're feeling the pain and need some kind of improvement here soon :)

I'm not sure why this approach wouldn't scale however, to me it seems like it could almost be entirely automated other than needing to merge PRs, but perhaps I'm missing something.

I agree that having full in-project control is the ideal, I guess I'm just struggling to see a concrete path that gets us there.

markcor commented 6 months ago

Also, we can probably apply similar ideas to our Azure images.

With the Azure images, much of this is in line with what we are doing and planning on doing.
https://github.com/mozilla-platform-ops/worker-images

bhearsum commented 6 months ago

Having full control of image building from within each project sounds wonderful! Is this something the TC team is planning to work towards? It might still be worth tackling some of this in the meantime as we're feeling the pain and need some kind of improvement here soon :)

I'm not sure why this approach wouldn't scale however, to me it seems like it could almost be entirely automated other than needing to merge PRs, but perhaps I'm missing something.

I agree that having full in-project control is the ideal, I guess I'm just struggling to see a concrete path that gets us there.

While not necessarily a blocker, a potential downside to doing this is that it would put image building in the critical path of running builds/tests. Obviously this is already the case for docker images - but depending on how much time it adds to the critical path it may have some significant downside. This is one of the big upsides about putting a reference to an already existing image in the tree - you gain the in-tree control over what things are built on without putting anything new in the critical path.

(I also agree that the ideal state is everything in the tree however.)

petemoore commented 6 months ago

I'm not sure why this approach wouldn't scale however, to me it seems like it could almost be entirely automated other than needing to merge PRs, but perhaps I'm missing something.

Apologies, it is certainly a more automated approach than we currently have and a definite improvement. By not scaling, I really mean that any time a human needs to intervene to approve something (such as merge a PR), from a different team, we potentially block each other. We don't have 24/7 coverage in teams, so people will invariably need to wait. The more images that are created and managed, the more human resources you need to handle the requests. You are always constrained by the number of people that can respond to requests. But agreed, it is a lot better than the current approach, but I think it would be good to aim for one that doesn't require any central approval, so teams can have full autonomy.

ahal commented 6 months ago

I think a key here is that it would be project maintainers merging the PRs, not releng or relops (well we would merge the PRs for new images, but not for new worker pools).

Tbh, I really don't feel comfortable about this stuff just automatically going live into production without any human intervention.

Edit: Re-reading your comment I don't think that's what you're suggesting, and I think you misunderstood my proposal. I'm proposing we move away from centralized gatekeepers here. See this line from initial comment:

Maintainers for each repo could decide to merge or close the PR.

hwine commented 6 months ago

/me notes that "decentralizing" some of this does change security boundaries, at least for the Fx CI case. I.e. RRA at some point, please.

mozilla-releng / releng-rfcs