kubernetes / test-infra

Test infrastructure for the Kubernetes project.
Apache License 2.0
3.83k stars 2.65k forks source link

Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-conformance-kind-ga-only-parallel #18850

Closed spiffxp closed 4 years ago

spiffxp commented 4 years ago

What should be cleaned up or changed:

This is part of #18550

To properly monitor the outcome of this, you should be a member of k8s-infra-prow-viewers@kubernetes.io. PR yourself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 if you're not a member.

Migrate pull-kubernetes-conformance-kind-ga-only-parallel to k8s-infra-prow-build by adding a cluster: k8s-infra-prow-build field to the job:

Once the PR has merged, note the date/time it merged. This will allow you to compare before/after behavior.

Things to watch for the job

Things to watch for the build cluster

Keep this open for at least 24h of weekday PR traffic. If everything continues to look good, then this can be closed.

/wg k8s-infra /sig testing /area jobs /help

neolit123 commented 4 years ago

/assign /remove-help

neolit123 commented 4 years ago

PR: https://github.com/kubernetes/test-infra/pull/18872

neolit123 commented 4 years ago

observations:

does the job start failing more often?

the failure rate does not seem to have decreased. given this job is testing GA-only, during code freeze i'd assume flakes, instead of changes in features causing the job to fail.

does the job duration look worse than before? spikier than before?

the overall duration seems to have decreased to ~20minutes on average from ~26/30 minutes?

do more failures show up than before?

doesn't seem like it, possibly more runs are needed on the new cluster to say for sure.

is the job wildly underutilizing its memory limit? if so, perhaps tune down (if uncertain, post evidence in this issue and ask)

i'd assume this kind cluster has 1CP2W nodes. https://github.com/kubernetes/test-infra/pull/18872/files#diff-e4b92f7fa3467cd10631b29b58d683daR290-R298 the job requests 4 cpus, 9gi. my estimate is that this is not very underutilized, but possibly the res. requests can be reduced. potential experiment here is to reduce them to half, but keep the limits.

neolit123 commented 4 years ago

looks like i need to PR myself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 to see the rest of the links.

EDIT: https://github.com/kubernetes/k8s.io/pull/1165

neolit123 commented 4 years ago

metrics explorer

i hope i'm reading the data correctly. it seems the memory and CPU "limit utilization" is around a maximum of ~0.5-0.6, without spikes above ~0.8 (1.0 == limit). this seems fine - can see minor adjustments but resources are not very underutilized.

cpu memory

spiffxp commented 4 years ago

/close Agreed, this looks good. Thanks for your help!

k8s-ci-robot commented 4 years ago

@spiffxp: Closing this issue.

In response to [this](https://github.com/kubernetes/test-infra/issues/18850#issuecomment-683195379): >/close >Agreed, this looks good. Thanks for your help! Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.