Open rjsadow opened 1 year ago
/sig k8s-infra /kind cleanup
For kops, we have a separate issue: https://github.com/kubernetes/k8s.io/issues/5127 I think for the moment it's better to focus on the presubmits.
@dims Can I pick up the cluster-api
jobs? I will migrate as much as I can.
@dims Can I pick up the
cluster-api
jobs? I will migrate as much as I can.
@rayandas @rjsadow There is an ongoing effort about cluster-api
. Please coordinate with cluster-api maintainers
Hello @dims, I am familiar with kubernetes/sig-release
jobs so is it okay to work on it?
cc @ameukam
@dims @ameukam I think the pull-test-infra-unit-test
job requiring resoure limits/requests for the community clusters is going to be a stumbling block for a lot of these migrations. Is there any existing documentation or resources that we can empower contributers to reference to for setting initial values then iterating? Or should we expect the sig maintainters and leads to help provide input on a job-by-job basis?
Example in https://github.com/kubernetes/test-infra/pull/29724 where I WAG'd .5CPU and 2GB and asked the cluster lifecycle leads to review.
I would like to learn more about this , so I will pick one kubernetes-sigs/kustomize | Search Results |
What needs to be done?
Fork and check out the kubernetes/test-infra repository, then follow the steps below:
- Find the name of the job you wish to check has a cluster specified or not say
pull-jobset-test-integration-main
from the"Search Results"
link aboveFind the yaml file in
kubernetes/test-infra
using our search:- Edit the file in step 2, in the job definition look for a
cluster:
key, if there isn't one then the job runs in thedefault
cluster, So add onecluster: eks-prow-build-cluster
. NOTE: if you see any entries underlabel
that saysgce
skip this job and go to the next time as this is not ready to be moved yet.- Save the file, commit the change, create a branch and file a PR
- Having trouble? Leave a note here in this issue and/or come to #sig-k8s-infra or #sig-testing slack channel to ask for help
@dims hi, thanks for sharing the steps. Also I noticed PR's moving to new clusters with just specifying the new cluster name, there are others specifying resource quotas (CPU/memory) while migrating. Is the latter requirement or generally maybe recommended while just specifying the name is also enough?
Is the latter requirement or generally maybe recommended while just specifying the name is also enough?
The Google-owned clusters did not require any resource definitions whereas the community-owned clusters do. If a job already has resource quotas (both requests and limits) then just the name is enough. If the job is missing any resource quotas then those will need to be added or else the pull-test-infra-unit-test
check will fail.
I will be working on kubernetes/node-problem-detector | Prow Results | Search Results
@rjsadow All jobs running against github.com/kubernetes-sigs/cluster-api-provider-azure
using prow presets preset-azure-cred-*
are not eligible to migrate AFAIK. There are using specific creds currently stored in a Google-owned GCP project.
we can try to sync to creds to the community infra before we do the migration.
Is there any existing documentation or resources that we can empower contributers to reference to for setting initial values then iterating? Or should we expect the sig maintainters and leads to help provide input on a job-by-job basis?
SIG maintainers and leads also won't know how how much resources are needed. As far as I can tell, this is pure guess work or in some cases (building Kubernetes from source) copy-and-pasted from other, older jobs where @BenTheElder at some distant past somehow determined suitable values (at that time!).
If we want to properly utilize the cluster, we have to enable monitoring, check actual resource usage against the declared values, and adjust those eventually.
PSA For anyone seeking recommendations on capacity sizing for resource quota requirements.
Many jobs currently lack a baseline understanding of resource usage for all prowjobs. If the PR author, SIG maintainers, or leads don't provide recommended values, we suggest starting with 2 CPUs
and 4 GB RAM
.
Once the jobs transition to the EKS cluster, resource usage monitoring can be done through Grafana at https://monitoring-eks.prow.k8s.io/d/96Q8oOOZk/builds?from=now-24h&to=now. The repositories and jobs will automatically be added to this monitoring system. Additional tuning for resource values may be necessary after the initial merge is complete. Please continue to monitor the jobs and ensure their resource usage is appropriately sized. Don't hesitate to reach out to SIG-K8s-Infra or SIG-Testing with any concerns.
I will take up kubernetes/kubernetes | Prow Results | Search Results | as my next file :+1:
@ameukam @rjsadow for the following repos I already move the jobs that we can move to the aws cluster
@cpanato can you tag this issue in the PRs related to the jobs migration ? just want to link issues and PRs easily.
@cpanato can you tag this issue in the PRs related to the jobs migration ? just want to link issues and PRs easily.
@ameukam added the PR above
Regarding resource requirements see also https://github.com/kubernetes/test-infra/issues/28800
Just an FYI:
We've had to revert one of the jobs used by image-builder
that relies on a secret that isn't migrated to the new eks-prow-build-cluster
.
See PR here, related issue here and Slack discussion here.
Edit:
The secret in question being clusterapi-provider-vsphere-ci-prow
which is part of the cluster-api-provider-vsphere-e2e-config
preset.
It seems that these credentials are to a vsphere environment not currently under community control and as such the credentials haven't been moved across.
@iftachk It seems like the presubmit jobs for Cluster API Add-on Provider Helm aren't working. I'm getting an error that the pods failed to start on the PR jobs.
How large are the nodes in the EKS cluster, i.e. what is the maximum resource requirement that a job can request?
How large are the nodes in the EKS cluster, i.e. what is the maximum resource requirement that a job can request?
16 vCPUs, 128 GB RAM, 300 GB NVMe SSD per https://github.com/kubernetes/test-infra/blob/master/docs/eks-jobs-migration.md?plain=1#L21
Here's a list of outstanding PRs that are waiting to be reviewed:
For folks still interested in helping with this migration, I've added a new tool to k/test-infra
to help identify repository status and which jobs are eligible to be transitioned to the community clusters from the GKE cluster. If interested, please take a look at these few gists to see how the tool works. We are getting towards the end of jobs that are currently eligible for transition, so we'll need to discuss how to handle follow on efforts (like https://github.com/kubernetes/test-infra/issues/30277) soon.
https://gist.github.com/rjsadow/c5d1695c73842e53449a772f00b00116 https://gist.github.com/rjsadow/d6cfe557f6ef3fa1cd2838053e3bb1fd https://gist.github.com/rjsadow/83c50e252fe0534e03d5aa2eb72503eb
Relevant PRs for the tool and additional info: https://github.com/kubernetes/test-infra/pull/30442 https://github.com/kubernetes/test-infra/pull/30467 https://kubernetes.slack.com/archives/CCK68P2Q2/p1692149062215619
/lifecycle frozen
We're also keeping this list up to date: https://github.com/kubernetes/test-infra/blob/master/docs/job-migration-todo.md
filed a few warnings: https://github.com/kubernetes/kops/issues/16637 https://github.com/kubernetes-sigs/node-feature-discovery/issues/1747 https://github.com/kubernetes/node-problem-detector/issues/920
+ messages in slack, in addition to the kubernetes-dev emails
In an ongoing effort to migrate to community-owned resources, SIG K8S Infra and SIG Testing are working to complete the migration of jobs from the Google-owned internal GKE cluster to community-owned clusters.
All jobs in the Prow Default Cluster that do not depend on external cloud assets should attempt migrate to
cluster: eks-prow-build-cluster
.What needs to be done?
To get started please see eks-jobs-migration for details.
Fork and check out the kubernetes/test-infra repository, then follow the steps below:
pull-jobset-test-integration-main
from the"Prow Results"
link below.pull-jobset-test-integration-main
is defined in from the"Search Results"
link, in the job definition look for acluster:
key, if there isn't one then the job runs in the default cluster, So add onecluster: eks-prow-build-cluster
. NOTE: if you see any entries underlabel
that saysgce
skip this job and go to the next time as this is not ready to be moved yet.NOTE: The Google-owned clusters did not require any resource definitions whereas the community-owned clusters do. If your merge is failing the
pull-test-infra-unit-test
job, please add CPU/Memory requests/limits. Work with the appropriate sig owners to determine the necessary capacity for each job.Below is a list of repos that currently have jobs in the default cluster.
Repos with default cluster jobs found
kubernetes-sigs/azuredisk-csi-driver | Prow Results | Search Results| Jobs require azure credentialskubernetes-sigs/azurefile-csi-driver | Prow Results | Search Results| Jobs require azure credentialskubernetes-sigs/cloud-provider-azure | Prow Results | Search Results| Jobs require azure credentialskubernetes-sigs/gcp-compute-persistent-disk-csi-driver | Prow Results | Search Results| Jobs require GCPkubernetes-sigs/gcp-filestore-csi-driver | Prow Results | Search Results| Jobs require GCPkubernetes-sigs/kube-storage-version-migrator | Prow Results | Search Results