[Umbrella Issue] Create a Image Promotion process

dims commented 5 years ago

Split from https://github.com/kubernetes/k8s.io/issues/153 (see that for some context)

cc @javier-b-perez @mkumatag @listx

dims commented 5 years ago

Notes from Nov 28th meeting

AI: @dims to follow up with Javier (current google promoter KEP)
2 issues here
storage/serving for arbitrary artifacts (including cloud-local mirrors)
Container registry for official images (including mirrors?)
AI: @bburns to contemplate registry mirroring, attestation, etc
AI: @justinsb to collab with @bburns
Undecided: do sub-projects push to one-true staging, and trust not to step on each other or push to per-sub areas?

dims commented 5 years ago

listx commented 5 years ago

I'm planning to OSS a proof of concept tool that understands how to reconcile a registry manifest (basically a list of images with digest/tag mappings) from a source registry to a destination registry. This should be available in January 2019. Once that tool is open sourced, I can wire up a basic promotion process for a number of images into a test registry to demonstrate how it will work.

For now the tool prototype deals with 2 registries (source vs dest, aka "staging" vs "prod"), but it is trivial to extend it to deal with more than 1 destination (so that we can have mirrors for example). After it is open sourced we can have more discussions about it in its own repository (or continue in this issue, I guess?).

I'll be offline the rest of this month so I'll see you guys in January!

spiffxp commented 5 years ago

/assign @dims Seen as the "larger" umbrella issue that could maybe subsume https://github.com/kubernetes/k8s.io/issues/158 (which assumes we need GCR repos per project, maybe we find a way to promote images that doesn't require this)

listx commented 5 years ago

Happy New Year all!

The tool I've worked on was submitted for internal review yesterday (as part of Google's open sourcing process). After it gets approved, I will create a public demo of it in action and update this issue accordingly.

dims commented 5 years ago

very cool @listx thanks for the update and a very happy new year to you as well.

listx commented 5 years ago

Update: The project has been approved and the Container Image Promoter now lives in https://github.com/GoogleCloudPlatform/k8s-container-image-promoter. Work now begins in creating a public demo of it in action (I plan to devote cycles on this to get the demo working by the end of Q1 this year).

Once the demo is complete, I think it's just a matter of using it as a template for migrating some of the official images from gcr.io/google-containers to another (probably CNCF-owned) GCR. I just imagine a future where the K8s container image release process happens more transparently for the community. Hopefully the image promotion process is a solid step in that direction.

listx commented 5 years ago

The design doc for the demo around this can be found here: https://docs.google.com/document/d/1WGFt5ck_XGf71PO4c87UMPVU_4Q7AV-7tRV4Z6wmZL4/edit?usp=sharing

listx commented 5 years ago

Another update: I have a demo Prow cluster (http://35.186.240.68/) that's listening to all changes to the manifest in https://github.com/cip-bot/cip-manifest. That repo houses a manifest that is obviously only for demo purposes, but if you have a look at this PR: https://github.com/cip-bot/cip-manifest/pull/2 you can see how a proposed changed to the manifest will trigger Prow jobs that perform a dry run of the promoter; merging that PR resulted in the promoter running for real (no dry run) and modifying the destination registry.

I would like to have https://github.com/GoogleCloudPlatform/k8s-container-image-promoter/issues/7 fixed before we think about really using this for existing (large-ish?) GCRs. It's not a big show-stopper though.

So basically like 90% of the pieces are there --- we just need to migrate the Prow job configs to either kubernetes/test-infra or somewhere else (the Prow jobs need to run on someone's cluster) and set up the right service-account permissions. Not sure where I should upload these Prow jobs --- maybe kubernetes/test-infra? @BenTheElder wdyt?

dims commented 5 years ago

@listx Nice!

+1 to add jobs to kubernetes/test-infra

Also, can we run the garbage collector in dry run mode to check what if any will get wiped out in production registry before turning it on?

dims commented 5 years ago

/assign @thockin @BenTheElder

listx commented 5 years ago

@dims After we make garbage collection aware of manifest lists, sure (otherwise it will print a bunch of false positives about needing to delete tagless images that are referenced by manifest lists). The more I think about it, the more I want to just separate GC entirely from promotion. Less complexity per execution of the promoter is a good thing.

And also, we could make GC much smarter and safer, by "promoting" to a "graveyard" GCR, in case anyone deletes a tag from the manifest by accident. Just an idea.

Anyway, we could also just disable garbage collection for the time being as it's not a critical feature as far as promotion is concerned.

dims commented 5 years ago

@listx makes sense "disable garbage collection for the time being as it's not a critical feature as far as promotion" +1

I like the graveyard GCR too :)

BenTheElder commented 5 years ago

kubernetes/test-infra SGTM, I would poke @fejta about our strategy for "trusted" jobs, as this should be one.

+1 to dry-run first, not sure I understand the graveyard GCR 🙃

dims commented 5 years ago

update on dockerhub integration https://github.com/GoogleCloudPlatform/k8s-container-image-promoter/issues/9

listx commented 5 years ago

+1 to dry-run first, not sure I understand the graveyard GCR

I was thinking that the graveyard GCR could host images that were deemed OK to delete (permanently) from a prod GCR. Thinking about this some more, though, maybe it's cleaner if we just implement soft-deletion (make the promoter "delete" images) by moving images to a different path within the same GCR.

Anyway the idea for keeping things around in the "graveyard" was to make sure we can undo image deletions --- just in case we accidentally delete an image for whatever reason.

hh commented 5 years ago

Action Items from February 20th Meeting:

[ ] @thockin / @listx Promoter + Prow Job
[ ] @thockin to provide a list of staging repos / groups who own them
[ ] @thockin Set up GCR Repo (scripted last week for staging)
[ ] @dims yaml file pushed to k8s.io repo
[ ] @listx Document procedures for eventually moving the image promoter repo to k8s-test-infra - Currently repo sits in Google GitHub org repo

I'm willing to help / coordinate with any of the above.

javier-b-perez commented 5 years ago

I have some security concerns about running this in prow. @thockin @listx will the prow job promote the container images? this mean that Prow will require write access in the GCR. Do we trust that no one else can "promote" images using prow?

fejta commented 5 years ago

IMO these image promotion jobs should run in their own security domain:

Separate from CI jobs which run arbitrary presubmit code after an /ok-to-test
Separate from trusted prow binaries which may write access to many kubernetes repos.

AKA we trust them more than standard jobs (only run on merged code) and less than prow itself (approvals are not restricted to prow oncall).

A good way to solve these issues would be for the wg-k8s-infra team to:

create a cluster dedicated to these running these jobs
give the jobs the necessary credentials to promote these images
configure prow to schedule these jobs in this new cluster (we already support and use this functionality).

Another idea might be to follow the pattern we use to have prow update itself:

A postsubmit job which builds and pushes new images
A daily job which creates a PR to promote prow to the latest images
- Nothing changes until prow oncall approves this PR
- PR description a diff of what changed between the two versions (example).
A postsubmit job which deploys the new images after merging the PR

That way the system is fully automated, but gated on someone trusted approving the PRs before they are used in production.

javier-b-perez commented 5 years ago

I was thinking in something simple (no infra/tools to maintain, runs on demand), just GCB + triggers: https://cloud.google.com/cloud-build/docs/running-builds/automate-builds

thockin commented 5 years ago

I am ambivalent about mechanism. Simple is good, but it needs to be debuggable and transparent.

On Wed, Feb 20, 2019 at 1:42 PM Javier B Perez notifications@github.com wrote:

I was thinking in something simple (no infra/tools to maintain, runs on demand), just GCB + triggers: https://cloud.google.com/cloud-build/docs/running-builds/automate-builds

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/k8s.io/issues/157#issuecomment-465766250, or mute the thread https://github.com/notifications/unsubscribe-auth/AFVgVGWy0CPBlpv91WyXw0_6lFki_S4iks5vPcFcgaJpZM4ZCsjM .

fejta commented 5 years ago

I would consider prow a better community choice than GCB:

Integrates with the rest of our tooling -- aka logs/metrics/failures will show up in gubernator, spyglass, testgrid and bigquery
Community is already familiar with prow (consistent rules for how to rerun, debug, etc)
ProwJobs can use the same knative-build spec GCB uses

The advantage of GCB is that it runs on a VM with auto-loading service account credentials. You can get the same effect on GKE by adding a node pool to the cluster with the necessary scopes gcloud container node-pools create --service-account=foo --scopes=bar and ensuring that these jobs always schedule on that pool and no one other jobs schedule in that pool.

listx commented 5 years ago

As a reminder, there is a design doc for this --- the relevant section is "CI Runtime" under "1. CI for CIP Manifest (demo)"). The demo is live at http://35.186.240.68/ --- for solving this issue itself, the current momentum is to just transfer the Prow jobs to github.com/kubernetes/test-infra.

Let's be clear that there is no technical winner of Prow vs GCB. Both are more than adequate for promoting images. I think the main reason to go with Prow is just the sheer number of features it has (such as the ones @fejta mentioned).

Another benefit of Prow is that you can have OWNERS approve certain PR merges --- this means that we can have more transparency/granularity about who merges PRs (e.g., have different OWNERS for different promotion manifest YAMLs, thus requiring different owners depending on the manifest YAML). Prow has already solved the problem about having more granular permissions around PR merges, so it makes sense to ride on the shoulders of that work.

listx commented 5 years ago

Update: Creating a new Prow cluster will make things simpler from a security standpoint. So https://github.com/kubernetes/test-infra/pull/11414 is actually unnecessary. Creating a new cluster makes things a lot simpler for me because I can just re-do what I did to set up the demo, just under a different GCP project.

The next task is to try to create a new Prow cluster in a GCP project owned by the CNCF. @thockin which one do you suggest? And can I get access to that GCP project so that I can set it up?

javier-b-perez commented 5 years ago

I have another idea. Prow runs the pre-submit with the tool in dry_run mode. Prow won't need write access to production registries. If someone (intentionally or accidentally) modify the prow job to run in 'real' mode, it will fail to push. This mean we can run in the current prow pool. This mean we still need to run the tool in 'real' mode as post-submit. In this case we can use GCB + Triggers.

Workflow:

Developers send changes for review
Prow will run pre-submit job and handle the labels/approvals of the change
Once change is merge into 'master' is up to GCB to run the real push.

listx commented 5 years ago

But having 2 different systems would be confusing, wouldn't it?

Some other observations: (A) I'd like to have a testgrid entry for all promoter runs, with and without the -dryrun flag. Using Prow for both type of runs would make adding testgrid entries trivial --- not sure what the equivalent would be for GCB runs.

(B) Eventually we want the promoter to only run on deltas of the manifest. GCB currently doesn't give you a full git clone (it's an extracted tarball of your repo without the .git folder), so you can't do deltas without doing extra git-clone logic yourself. Running only on deltas is important for the real push, because we want to do the minimal amount of work when touching the prod registry.

(C) GCB triggers work by running a cloudbuild.yaml file in the repo that is being watched. So in this case you would be triggering against the promoter manifest at commit version X, but also the cloudbuild.yaml at this same version X. In the Prow-only solution, the Prow job config (Prow equivalent of cloudbuild.yaml) would live in the same repo as the promoter manifests (github.com/kubernetes/k8s.io), but the push logic wouldn't necessarily be in lockstep at the same commit version X as the manifest. I think this flexibility would be useful in the future.

listx commented 5 years ago

Another update: the Prow configs have been merged. We have to now test image promotion... for reals! I'm going to file a PR against github.com/kubernetes/k8s.io to add a test image and try to see what happens, unless someone beats me to the punch.

The testgrid entries for the production runs ({ci,post}-k8sio-cip) should be getting populated soon. The ci one runs daily so if nothing shows up by tomorrow, we have more things to fix.

Edit: added links

listx commented 5 years ago

Now that https://github.com/kubernetes/k8s.io/pull/220 is merged the promoter manifests in https://github.com/kubernetes/k8s.io/tree/master/k8s.gcr.io are now live. This means that the testgrid entries I mentioned earlier should start to be green.

There are some remaining features to implement, but they are not blockers to the promotion process itself.

Meanwhile maybe we can try to actually use the promoter? I.e., push images to the staging registries in https://github.com/kubernetes/k8s.io/tree/master/k8s.gcr.io and then modify the promoter manifest as PRs and see the jobs do their thing. It would help identify any bugs not seen during testing.

listx commented 5 years ago

@dims I think we can close this issue. I've done some additional testing and the promoter is working as expected. If there are any remaining bugs then it would be around the infra logic around using the promoter (the prow job definitions), not the promoter itself.

The next step would be to give interested parties write access to the staging repositories under https://github.com/kubernetes/k8s.io/tree/master/k8s.gcr.io. I've created a PR to get the ball rolling https://github.com/kubernetes/k8s.io/pull/230.

Please cc me for the first PR against one of the promoter manifests --- I'd like to assist in case of any hiccups along the way.

dims commented 5 years ago

/unassign

spiffxp commented 5 years ago

There was discussion of demoing something at the next wg-k8s-infra meeting, making a PR against one of the manifests. Can we do this live?

spiffxp commented 5 years ago

Doesn't have what we would call a lot of testing (integration, e2e, etc)

Are we comfortable going live without that level of testing...

the results of it going haywire not great
but the internal version is sort of live

We can do demos of it, but we're still not ready to drive k8s.gcr.io from it? Use a staging repo and compare the results?

spiffxp commented 5 years ago

Looking to use this to migrate anything that lives in k8s.gcr.io first, before we think about using this fo subprojects

listx commented 5 years ago

Sorry for the long response, but I wanted to summarize my thoughts on this as best I can. Hopefully it sheds light for those people who have not been in the slack #wg-k8s-infra meetings.

There was discussion of demoing something at the next wg-k8s-infra meeting, making a PR against one of the manifests. Can we do this live?

We can indeed do this live today as described in https://github.com/kubernetes/kubernetes/pull/75115#issuecomment-486809383.

Doesn't have what we would call a lot of testing (integration, e2e, etc)

Are we comfortable going live without that level of testing...

the results of it going haywire not great

but the internal version is sort of live

If we want to make sure there is an e2e framework first, we should definitely hold off on closing this (I get the feeling that that is the consensus). The reason I suggested closing this issue was because the title is a bit misleading (seems to imply that there is no process, when as of today, a process does exist, though in limited scope).

We can do demos of it, but we're still not ready to drive k8s.gcr.io from it? Use a staging repo and compare the results?

Correct, we're not ready to drive k8s.gcr.io just yet. For driving k8s.gcr.io (the big one!), some additional things need to be done, such as grandfathering in all the existing images to the new gcr.io/k8s-gcr-prod registry. The additional things that need to be done are pretty much the same as described in https://github.com/kubernetes/kubernetes/pull/75115#issuecomment-486809383.

Looking to use this to migrate anything that lives in k8s.gcr.io first, before we think about using this fo subprojects

Well we already support the subprojects in k/k8s.io. If a PR lands in the k/k8s.io repo, when we merge it we will promote to gcr.io/k8s-gcr-prod. We have this mechanism set up because those subproject staging repos are empty and it makes it a nice environment for jumpstarting the promoter.

I'm just trying to see ways of doing things incrementally before launching it full throttle at scale. IIRC the reason we chose to support the promotion process those subprojects only in k/k8s.io (for now) was for incremental live usage.

claudiubelu commented 5 years ago

/cc @bclau

spiffxp commented 5 years ago

Gating on:

e2e testing of container image promoter
DR strategy

spiffxp commented 5 years ago

Migrating existing images:

copy all existing images to a "legacy" staging repo
will need to generate yaml for this
update promoter to generate this yaml file
prevent people from pushing old images / images to old locations (everything under k8s.gcr.io)
- 475 named repos, lots of images per repo
- promoter could do this by reverting old repos to known state
declare a moving
flip k8s.gcr.io vanity domain to point to new production repo

spiffxp commented 5 years ago

Possibly talk through DR strategy at contrib summit

spiffxp commented 5 years ago

currently converting cloudbuild jobs to prow jobs
but what level of e2e testing is necessary?
justin will be co-maintainer of container-image-promoter
how do we know if when the promoter code changes, that it doesn't accidentally wipe out the entire prod repo? or promote garbage? if it did, how would we recover?
can we reduce the scope of what the promoter does (ie: should we make deletion a thing that is done by hand only... historically we've never deleted anything)
could we focus on a specific sig or subproject vs. doing the whole thing at once? eg: images.k8s.io cname to k8s.gcr.io (no, because of TLS certificate)

spiffxp commented 5 years ago

It sounds like the container image promoter is really becoming more of a "container image and binary" promoter, or "artifact promoter"

spiffxp commented 5 years ago

https://docs.google.com/document/d/1PgLI7OCEd09qLLz9yGmY2KK0uxhROc9GGC-PkwgnKN4/edit - e2e testing plan https://github.com/kubernetes-sigs/k8s-container-image-promoter/pull/62 - PR for e2e testing

spiffxp commented 5 years ago

https://groups.google.com/forum/#!topic/kubernetes-dev/ZqdjjDIAISY - sent out e-mail describing how to add staging repos

@amy and @listx working through use of service account to get to point where they can start implementing e2e test plan as documented above

want to have promoter reconcile more than one manifest at a time

need to disable promoter from deleting images (move the code out so it's not possible)

restoration/recovery of images is still something we haven't specifically planned out - our current plan is documented at /dev/null

let's explore what our options are out there and decide if having a DR plan is a blocker

choose one subproject (someone not working on cip) to go through this whole flow

won't be able to use the vanity name until everything is flipped over

legacy images: bulk import into a legacy staging repo

spiffxp commented 5 years ago

Courtesy of @amy on #wg-k8s-infra:

Okay. Looks like these are the relevant remaining issues that block image promotion

https://github.com/kubernetes-sigs/k8s-container-image-promoter/issues/70

https://github.com/kubernetes-sigs/k8s-container-image-promoter/issues/69

https://github.com/kubernetes-sigs/k8s-container-image-promoter/issues/65

https://github.com/kubernetes-sigs/k8s-container-image-promoter/issues/64

https://github.com/kubernetes-sigs/k8s-container-image-promoter/issues/63

EDIT: Have someone from cluster-api to use the promoter: https://github.com/kubernetes/k8s.io/issues/300

spiffxp commented 5 years ago

e2e test cases still outstanding, and need to be turned on

thockin commented 5 years ago

Update: promoter works. Backup is in-progress. Auditing/alerting is in-progress. I am still hopeful we can go live before EOY'19

claudiubelu commented 5 years ago

FWIW, I've sent some PRs for having the Image Promoter also work for the kubernetes/kubernetes E2E test images:

https://github.com/kubernetes/kubernetes/pull/84058 https://github.com/kubernetes/k8s.io/pull/400 https://github.com/kubernetes/test-infra/pull/14833

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

listx commented 4 years ago

/remove-lifecycle stale

spiffxp commented 4 years ago

/unassign @BenTheElder @thockin I believe we can call this done once the vanity domain flip happens, and we're comfortable with the outcome. AFAIK @listx is the only person actively working on this

Expect the next flip to happen "soon", but whenever it happens we plan to schedule it on a Monday. It takes ~4 days for the flip to rollout, and we'd like to be able to catch any errors during working hours.

Currently blocked on: we tried doing the flip once and ran into issues that halted rollout. Currently looking to discover and untangle any (google internal) lingering hardcoded dependencies on google-containers. Once that's done we'll schedule/announce the next flip.

spiffxp commented 4 years ago

few days or weekish worth of work, but it's difficult to get priority for this amongst disparate teams, current estimate 25th

kubernetes / k8s.io

[Umbrella Issue] Create a Image Promotion process #157