kubernetes / test-infra

Test infrastructure for the Kubernetes project.
Apache License 2.0
3.85k stars 2.66k forks source link

migrate away from `test-infra-trusted` build cluster #32432

Closed ameukam closed 3 months ago

ameukam commented 7 months ago

There are a few jobs running on the test-infra-trusted we should either migrate to k8s-infra-prow-build-trusted or remove:

ameukam commented 7 months ago

/assign @michelle192837 /sig testing

BenTheElder commented 7 months ago

ci-test-infra-update-slack-oncall

no point migrating this, we'll just shut it down when prow is migrated and instead people can posted in #testing-ops in slack.

we should actually probably proactively stop advertising @test-infra-oncall to the broader project.

post-test-infra-upload-testgrid-config

.... uhhhh this one I'm not sure, because we have to be able to publish to testgrid's config bucket .... migrating testgrid is another fun topic

The image publishing jobs we should be able to move over.

michelle192837 commented 7 months ago

re: ci-test-infra-update-slack-oncall: Ah, that's easier then.

re: post-test-infra-upload-testgrid-config: I think this should be doable. I have not gone through the full details, but imo thanks to config merger merging configs for TestGrid from multiple locations, we can stand up a new config upload job in community-owned infra, verify the uploaded config in the new location is the same as the old, and swap the config location used in the TestGrid instance overall.

BenTheElder commented 5 months ago

On the K8s infra side we're going to need a bucket for this to start then, cc @upodroid @ameukam for thoughts.

post-test-infra-push-git post-test-infra-push-git-custom-k8s-auth

Not sure how these didn't wind up getting migrated yet ... looks like this is part of k8s-testimages https://github.com/kubernetes/k8s.io/issues/1523

I don't see evidence that we're actually using these images in Kubernetes and we should probably just delete them.

Prow has built in known-hosts handlinmg in clonerefs these days, I don't think we need these anymore.

michelle192837 commented 5 months ago

Sorry for the delay, I'm looking into this and some of the other unmigrated jobs today.

BenTheElder commented 5 months ago

in #32808 the list should be clearer now, a lot of these are related to running prow so that's fine, but some are pushing images and that's concerning, we should either eliminate or migrate them.

BenTheElder commented 5 months ago

here's one https://github.com/kubernetes/test-infra/pull/32812

BenTheElder commented 5 months ago
File Path Job Link
config/jobs/kubernetes/test-infra/test-infra-periodics.yaml job-migration-todo-report Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-autobump-prow-for-auto-deploy Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-autobump-prow Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-update-slack-oncall Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-branchprotector Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-label-sync Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-gencred-refresh-kubeconfig Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-rotate-legacy-default-build-sa-json-key Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-alpine Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-gcloud-terraform Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-git Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-git-custom-k8s-auth Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-deploy-prow Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-reconcile-hmacs Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-misc-images Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-kettle Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-bazel Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-gcb-docker-gcloud Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-test-gubernator Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-gencred Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-gencred-refresh-kubeconfig Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-upload-oncall Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-upload-testgrid-config Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-upload-boskos-config Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-cip-prow Search Results

SIG Contribex:

File Path Job Link
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-community-tempelis-apply Search Results

Not trusted cluster, but the other non-migrated jobs with test-infra in the name (there could be more) ...

File Path Job Link
config/jobs/kubernetes/test-infra/janitors.yaml maintenance-pull-janitor Search Results
config/jobs/kubernetes/test-infra/janitors.yaml maintenance-ci-aws-janitor Search Results
config/jobs/kubernetes/test-infra/janitors.yaml maintenance-ci-janitor Search Results
BenTheElder commented 5 months ago

Janitor jobs: won't be migrated, will be turned down.

post-test-infra-upload-oncall, ci-test-infra-update-slack-oncall: no need, this will be obsolete.

job-migration-todo-report: will be obsolete, also this isn't working correctly and we're just manually checking in the tool output, I'll clean this one up.

ci-test-infra-rotate-legacy-default-build-sa-json-key: will be obsolete

post-test-infra-upload-boskos-config: will be obsolete, we have a different boskos config in github.com/kubernetes/k8s.io for community boskos resources

post-test-infra-cip-prow: I deleted this in #32812

post-test-infra-push.* are concerning. post-test-infra-upload-testgrid-config will need migrating

I'm guessing renconcile hmacs needs to be considered as part of control plane migration, along with definitely branchprotector.

BenTheElder commented 5 months ago

https://github.com/kubernetes/test-infra/pull/32814 will remove the job-migration-todo-report report job.

ci-test-infra-label-sync should be able to migrate to k8s-infra-prow-build-trusted without waiting for the rest of prow, but we might not have the right secrets available yet.

michelle192837 commented 5 months ago

On the K8s infra side we're going to need a bucket for this to start then, cc @upodroid @ameukam for thoughts.

post-test-infra-push-git post-test-infra-push-git-custom-k8s-auth

Not sure how these didn't wind up getting migrated yet ... looks like this is part of k8s-testimages kubernetes/k8s.io#1523

I don't see evidence that we're actually using these images in Kubernetes and we should probably just delete them.

Prow has built in known-hosts handlinmg in clonerefs these days, I don't think we need these anymore.

These are used as the base images for building Prow images (https://cs.k8s.io/?q=gcr.io%2Fk8s-prow%2Fgit&i=nope&files=&excludeFiles=&repos=). I think we can replace the git image with alpine, but git-custom-k8s-auth might need to stay?

michelle192837 commented 5 months ago
Job Link Uses
post-test-infra-push-alpine Search Results Search Results
post-test-infra-push-gcloud-terraform Search Results Search Results
post-test-infra-push-git Search Results Search Results
post-test-infra-push-git-custom-k8s-auth Search Results Search Results
post-test-infra-push-misc-images Search Results Search Results
post-test-infra-push-kettle Search Results Search Results
post-test-infra-push-bazel Search Results Search Results
post-test-infra-push-gcb-docker-gcloud Search Results Search Results
post-test-infra-push-test-gubernator Search Results Search Results
post-test-infra-push-gencred Search Results Search Results

Several of these push images that aren't used and should be turned down (post-test-infra-push-test-gubernator, post-test-infra-push-bazel, post-test-infra-push-gcloud-terraform, post-test-infra-push-gencred).

michelle192837 commented 5 months ago

Discussed offline: for post-test-infra-push-git and post-test-infra-push-git-custom-k8s-auth, since we'll need to migrate the latter anyways, we can migrate the former at the same time, then see if we can replace the git image base with alpine instead.

BenTheElder commented 5 months ago

then see if we can replace the git image base with alpine instead.

we should probably use something else, we generally prefer to use e.g. debian/distroless for kubernetes base images, for licensing reasons (alpine/busybox) and alignment on patching etc.

BenTheElder commented 4 months ago

I'm working on tempelis https://kubernetes.slack.com/archives/C4M06S5HS/p1719431441099159 https://github.com/kubernetes/test-infra/pull/32928

ameukam commented 4 months ago

Sorry for the late response. I can confirm that git-custom-k8s-auth is used by prow to authenticate to non-GKE clusters (currently it's only EKS)

upodroid commented 4 months ago

https://github.com/kubernetes-sigs/prow/blob/main/.ko.yaml

+1 for building a unified base image for prow that has git, the kubectl auth plugins for our cloud vendors

upodroid commented 4 months ago

We can migrate that job to the community cluster and update the .ko.yaml references

BenTheElder commented 4 months ago

We can do something similar to the distroless-iptables image in k/release.

BenTheElder commented 4 months ago

tempelis will be done after #32946

BenTheElder commented 4 months ago

https://github.com/kubernetes/test-infra/pull/32948 will do label sync

michelle192837 commented 4 months ago
File Path Job Link Uses
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-alpine Search Results Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-gcb-docker-gcloud Search Results Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-git Search Results Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-git-custom-k8s-auth Search Results Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-kettle Search Results Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-misc-images Search Results Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-upload-testgrid-config Search Results

With the linked PRs, we should have a canary job for all these jobs. Once these are submitted and we have new images for all of them, I'll switch the relevant uses to use the k8s-staging-test-infra images instead, and turn down the old image pushing jobs.

(The TestGrid config switch is a bit more involved but not much more. I just need to swap what config is referenced in the mergelists after verifying the new is the same as the old, and that config merger has permissions to read from the new bucket. I'll look into that now.)

BenTheElder commented 4 months ago

After today's SIG meeting I eliminated the oncall update jobs (slack, GCS) #33083 #33084

We should probably pre-emptively migrate ci-test-infra-branchprotector to the new trusted cluster.

BenTheElder commented 4 months ago

migrating branch protector looks straightforward, will send a PR in a little bit.

BenTheElder commented 4 months ago

https://github.com/kubernetes/test-infra/pull/33098 takes care of the branch protector.

That leaves:

So when we move prow we'll also have a small list of jobs to disable and we should probably prepare that.

These are the main remaining jobs aside from the following out of scope here:

So we should definitely focus on these while Azure folks work on migrating those.

I've also noticed that we'll have to be careful updating the prow deployment specs for the new cluster, because e.g. we gave the secrets clearer names and a different path for the github token.

ameukam commented 4 months ago

IMHO, we can remove post-test-infra-upload-boskos-config. we no longer need to increase the boskos pool and potentially need to shutdown the GCP projects part of it.

michelle192837 commented 4 months ago

Fixing TestGrid upload job today and cleaning up some of the image jobs/references.

BenTheElder commented 4 months ago

IMHO, we can remove post-test-infra-upload-boskos-config. we no longer need to increase the boskos pool and potentially need to shutdown the GCP projects part of it.

agreed, filed https://github.com/kubernetes/test-infra/pull/33121

michelle192837 commented 4 months ago

TestGrid upload progress:

diff k8s-testgrid-config.textproto k8s-infra-testgrid-config.textproto

This produces no diffs


(And these do have contents):

wc -l k8s-testgrid-config.textproto 519759 k8s-testgrid-config.textproto

wc -l k8s-infra-testgrid-config.textproto 519759 k8s-infra-testgrid-config.textproto



Now following the config merger instructions at https://github.com/kubernetes/test-infra/blob/master/testgrid/merging.md#config-merger. I'll have a few PRs out for those.
michelle192837 commented 4 months ago

Remaining from my list above:

File Path Job Link Uses
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-alpine Search Results Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-git Search Results Search Results
config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-push-misc-images Search Results Search Results

post-test-infra-push-alpine just needs minor cleanup, then it can be deleted. post-test-infra-push-git can probably be deleted; the remaining use of it is as the base for certain Prow images. I can't switch them over immediately (integration tests fail when switching from the January image to a recent July image), but I believe switching to an image from the old location will have the same problem. post-test-infra-push-misc-images needs a fix (I think the most recent PR will fix it, but it needs a retrigger to verify that's the case), then the images need to be switched to the new location before the old job is turned down.

(And last bit of cleanup, move all the new image push jobs to the image-pushes dashboard and remove '-canary' from the job name).

michelle192837 commented 3 months ago

post-test-infra-push-misc-images technically passes, but it doesn't seem to be uploading new images? (I think the same is happening for the new prow images push, which does something similar.)

post-test-infra-push-alpine and post-test-infra-push-git I think we can delete for the reasoning above. The minor cleanup isn't blocking removing the old jobs.

michelle192837 commented 3 months ago

lol I lied, the misc-image canary is working fine. I'll switch those uses over today.

I'm still not seeing new Prow images uploaded to the new location though. (https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-k8s-infra-prow-images/1818232059856949248, https://storage.googleapis.com/kubernetes-jenkins/logs/post-k8s-infra-prow-images/1818232059856949248/artifacts/build.log for the build log). Since it's doing something similar to the misc-images push job, I might update it to be similar and see if that fixes it.

michelle192837 commented 3 months ago

Sorry about the confusion, the Prow images job has been working the whole time and I was just confused. (More detail in https://github.com/kubernetes-sigs/prow/pull/217#issuecomment-2266208571).

Anyways, remaining updates are:

I'll leave submission of those to Monday, but those should handle the last test-infra jobs that I think we're actually handling?

BenTheElder commented 3 months ago
secrets path job
[] config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-gencred-refresh-kubeconfig
[] config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-rotate-legacy-default-build-sa-json-key
[] config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-deploy-prow
[] config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-gencred-refresh-kubeconfig
[kubeconfig-prow-services oauth-token] config/jobs/kubernetes/test-infra/test-infra-trusted.yaml post-test-infra-reconcile-hmacs
[oauth-token k8s-ci-robot-ssh-keys] config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-autobump-prow
[oauth-token k8s-ci-robot-ssh-keys] config/jobs/kubernetes/test-infra/test-infra-trusted.yaml ci-test-infra-autobump-prow-for-auto-deploy

Of those, I think we might need reconcile-hmacs to move along with the new prow deployment?

Otherwise I think rest should probably be spun down just ahead of migrating prow, and remain in the meantime to keep the legacy instance humming.

https://github.com/kubernetes/test-infra/issues/33129 covers the janitor jobs.

BenTheElder commented 3 months ago

We only have these six left now:

cjwagner commented 3 months ago

post-test-infra-reconcile-hmacs

  • Decision: keep until we're ready to migrate prow control plane, job will not migrate. (cc @cjwagner to confirm)

Yes that does not need to migrate assuming that the K8s-Infra Prow is using a GitHub App to manage webhooks (rather than manually configuring them per org or repo) . IIRC someone confirmed this in the last SIG-Testing meeting.

The other decisions SGTM as well.

michelle192837 commented 3 months ago

Now done thanks to Ben: https://github.com/kubernetes/test-infra/pull/33352