kubernetes / test-infra

Test infrastructure for the Kubernetes project.
Apache License 2.0
3.8k stars 2.6k forks source link

[Umbrella Issue] Migrate prow jobs to community clusters #29722

Open rjsadow opened 1 year ago

rjsadow commented 1 year ago

In an ongoing effort to migrate to community-owned resources, SIG K8S Infra and SIG Testing are working to complete the migration of jobs from the Google-owned internal GKE cluster to community-owned clusters.

All jobs in the Prow Default Cluster that do not depend on external cloud assets should attempt migrate to cluster: eks-prow-build-cluster.

What needs to be done?

To get started please see eks-jobs-migration for details.

Fork and check out the kubernetes/test-infra repository, then follow the steps below:

  1. Find the name of the job you wish to check has a cluster specified or not say pull-jobset-test-integration-main from the "Prow Results" link below.
  2. Edit the file that pull-jobset-test-integration-main is defined in from the "Search Results" link, in the job definition look for a cluster: key, if there isn't one then the job runs in the default cluster, So add one cluster: eks-prow-build-cluster. NOTE: if you see any entries under label that says gce skip this job and go to the next time as this is not ready to be moved yet.
  3. Save the file, commit the change, create a branch and file a PR
  4. Having trouble? Leave a note here in this issue and/or come to #sig-k8s-infra or #sig-testing slack channel to ask for help

NOTE: The Google-owned clusters did not require any resource definitions whereas the community-owned clusters do. If your merge is failing the pull-test-infra-unit-test job, please add CPU/Memory requests/limits. Work with the appropriate sig owners to determine the necessary capacity for each job.

Below is a list of repos that currently have jobs in the default cluster.

Repos with default cluster jobs found

rjsadow commented 1 year ago

/sig k8s-infra /kind cleanup

ameukam commented 1 year ago

For kops, we have a separate issue: https://github.com/kubernetes/k8s.io/issues/5127 I think for the moment it's better to focus on the presubmits.

rayandas commented 1 year ago

@dims Can I pick up the cluster-api jobs? I will migrate as much as I can.

ameukam commented 1 year ago

@dims Can I pick up the cluster-api jobs? I will migrate as much as I can.

@rayandas @rjsadow There is an ongoing effort about cluster-api. Please coordinate with cluster-api maintainers

ArkaSaha30 commented 1 year ago

Hello @dims, I am familiar with kubernetes/sig-release jobs so is it okay to work on it? cc @ameukam

rjsadow commented 1 year ago

@dims @ameukam I think the pull-test-infra-unit-test job requiring resoure limits/requests for the community clusters is going to be a stumbling block for a lot of these migrations. Is there any existing documentation or resources that we can empower contributers to reference to for setting initial values then iterating? Or should we expect the sig maintainters and leads to help provide input on a job-by-job basis?

Example in https://github.com/kubernetes/test-infra/pull/29724 where I WAG'd .5CPU and 2GB and asked the cluster lifecycle leads to review.

ShivamTyagi12345 commented 1 year ago

I would like to learn more about this , so I will pick one kubernetes-sigs/kustomize | Search Results |

furkatgofurov7 commented 1 year ago

What needs to be done?

Fork and check out the kubernetes/test-infra repository, then follow the steps below:

  1. Find the name of the job you wish to check has a cluster specified or not say pull-jobset-test-integration-main from the "Search Results" link above
  2. Find the yaml file in kubernetes/test-infra using our search:

  3. Edit the file in step 2, in the job definition look for a cluster: key, if there isn't one then the job runs in the default cluster, So add one cluster: eks-prow-build-cluster. NOTE: if you see any entries under label that says gce skip this job and go to the next time as this is not ready to be moved yet.
  4. Save the file, commit the change, create a branch and file a PR
  5. Having trouble? Leave a note here in this issue and/or come to #sig-k8s-infra or #sig-testing slack channel to ask for help

@dims hi, thanks for sharing the steps. Also I noticed PR's moving to new clusters with just specifying the new cluster name, there are others specifying resource quotas (CPU/memory) while migrating. Is the latter requirement or generally maybe recommended while just specifying the name is also enough?

rjsadow commented 1 year ago

Is the latter requirement or generally maybe recommended while just specifying the name is also enough?

The Google-owned clusters did not require any resource definitions whereas the community-owned clusters do. If a job already has resource quotas (both requests and limits) then just the name is enough. If the job is missing any resource quotas then those will need to be added or else the pull-test-infra-unit-test check will fail.

Vyom-Yadav commented 1 year ago

I will be working on kubernetes/node-problem-detector | Prow Results | Search Results

ameukam commented 1 year ago

@rjsadow All jobs running against github.com/kubernetes-sigs/cluster-api-provider-azure using prow presets preset-azure-cred-* are not eligible to migrate AFAIK. There are using specific creds currently stored in a Google-owned GCP project. we can try to sync to creds to the community infra before we do the migration.

pohly commented 1 year ago

Is there any existing documentation or resources that we can empower contributers to reference to for setting initial values then iterating? Or should we expect the sig maintainters and leads to help provide input on a job-by-job basis?

SIG maintainers and leads also won't know how how much resources are needed. As far as I can tell, this is pure guess work or in some cases (building Kubernetes from source) copy-and-pasted from other, older jobs where @BenTheElder at some distant past somehow determined suitable values (at that time!).

If we want to properly utilize the cluster, we have to enable monitoring, check actual resource usage against the declared values, and adjust those eventually.

rjsadow commented 1 year ago

PSA For anyone seeking recommendations on capacity sizing for resource quota requirements.

Many jobs currently lack a baseline understanding of resource usage for all prowjobs. If the PR author, SIG maintainers, or leads don't provide recommended values, we suggest starting with 2 CPUs and 4 GB RAM.

Once the jobs transition to the EKS cluster, resource usage monitoring can be done through Grafana at https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?from=now-24h&to=now. The repositories and jobs will automatically be added to this monitoring system. Additional tuning for resource values may be necessary after the initial merge is complete. Please continue to monitor the jobs and ensure their resource usage is appropriately sized. Don't hesitate to reach out to SIG-K8s-Infra or SIG-Testing with any concerns.

ShivamTyagi12345 commented 1 year ago

I will take up kubernetes/kubernetes | Prow Results | Search Results | as my next file :+1:

cpanato commented 1 year ago

@ameukam @rjsadow for the following repos I already move the jobs that we can move to the aws cluster

kubernetes-sigs/cluster-api-provider-digitalocean

kubernetes-sigs/cluster-api-provider-gcp

ameukam commented 1 year ago

@cpanato can you tag this issue in the PRs related to the jobs migration ? just want to link issues and PRs easily.

cpanato commented 1 year ago

@cpanato can you tag this issue in the PRs related to the jobs migration ? just want to link issues and PRs easily.

@ameukam added the PR above

pohly commented 1 year ago

Regarding resource requirements see also https://github.com/kubernetes/test-infra/issues/28800

AverageMarcus commented 1 year ago

Just an FYI:

We've had to revert one of the jobs used by image-builder that relies on a secret that isn't migrated to the new eks-prow-build-cluster.

See PR here, related issue here and Slack discussion here.

Edit:

The secret in question being clusterapi-provider-vsphere-ci-prow which is part of the cluster-api-provider-vsphere-e2e-config preset.

It seems that these credentials are to a vsphere environment not currently under community control and as such the credentials haven't been moved across.

Jont828 commented 12 months ago

@iftachk It seems like the presubmit jobs for Cluster API Add-on Provider Helm aren't working. I'm getting an error that the pods failed to start on the PR jobs.

pohly commented 11 months ago

How large are the nodes in the EKS cluster, i.e. what is the maximum resource requirement that a job can request?

rjsadow commented 11 months ago

How large are the nodes in the EKS cluster, i.e. what is the maximum resource requirement that a job can request?

16 vCPUs, 128 GB RAM, 300 GB NVMe SSD per https://github.com/kubernetes/test-infra/blob/master/docs/eks-jobs-migration.md?plain=1#L21

rjsadow commented 11 months ago

Here's a list of outstanding PRs that are waiting to be reviewed:

rjsadow commented 10 months ago

For folks still interested in helping with this migration, I've added a new tool to k/test-infra to help identify repository status and which jobs are eligible to be transitioned to the community clusters from the GKE cluster. If interested, please take a look at these few gists to see how the tool works. We are getting towards the end of jobs that are currently eligible for transition, so we'll need to discuss how to handle follow on efforts (like https://github.com/kubernetes/test-infra/issues/30277) soon.

https://gist.github.com/rjsadow/c5d1695c73842e53449a772f00b00116 https://gist.github.com/rjsadow/d6cfe557f6ef3fa1cd2838053e3bb1fd https://gist.github.com/rjsadow/83c50e252fe0534e03d5aa2eb72503eb

Relevant PRs for the tool and additional info: https://github.com/kubernetes/test-infra/pull/30442 https://github.com/kubernetes/test-infra/pull/30467 https://kubernetes.slack.com/archives/CCK68P2Q2/p1692149062215619

BenTheElder commented 3 months ago

/lifecycle frozen

BenTheElder commented 2 months ago

xref: https://github.com/kubernetes/test-infra/issues/32432

BenTheElder commented 4 days ago

We're also keeping this list up to date: https://github.com/kubernetes/test-infra/blob/master/docs/job-migration-todo.md

BenTheElder commented 4 days ago

filed a few warnings: https://github.com/kubernetes/kops/issues/16637 https://github.com/kubernetes-sigs/node-feature-discovery/issues/1747 https://github.com/kubernetes/node-problem-detector/issues/920

+ messages in slack, in addition to the kubernetes-dev emails