Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-conformance-kind-ga-only-parallel

spiffxp commented 4 years ago

What should be cleaned up or changed:

This is part of #18550

To properly monitor the outcome of this, you should be a member of k8s-infra-prow-viewers@kubernetes.io. PR yourself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 if you're not a member.

Migrate pull-kubernetes-conformance-kind-ga-only-parallel to k8s-infra-prow-build by adding a cluster: k8s-infra-prow-build field to the job:

Make sure any release-branch variants are also updated
Reference this issue, but do not "fix" it; instead leave this open to monitor the behavior of the job before/after
Here's an example PR that did this for pull-kubernetes-integration: https://github.com/kubernetes/test-infra/pull/18814

Once the PR has merged, note the date/time it merged. This will allow you to compare before/after behavior.

Things to watch for the job

https://prow.k8s.io/?job=pull-kubernetes-conformance-kind-ga-only-parallel
- does the job start failing more often?
- does the job start going into error state?
https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-conformance-kind-ga-only-parallel&graph-metrics=test-duration-minutes
- does the job duration look worse than before? spikier than before?
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=pull-kubernetes-conformance-kind-ga-only-parallel
- do more failures show up than before?
https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-conformance-kind-ga-only-parallel
- (can be used to answer some of the same questions as above)
metrics explorer: CPU limit utilization for pull-kubernetes-conformance-kind-ga-only-parallel for 6h
- is the job wildly underutilizing its CPU limit? if so, perhaps tune down (if uncertain, post evidence in this issue and ask)
- (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)
metrics explorer: Memory limit utilization for pull-kubernetes-conformance-kind-ga-only-parallel for 6h
- is the job wildly underutilizing its memory limit? if so, perhaps tune down (if uncertain, post evidence in this issue and ask)
- (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)

Things to watch for the build cluster

prow-build dashboard 1w
- is the build cluster scaling as needed? (e.g. maybe it can't scale because we've hit some kind of quota)
- (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)
prowjobs-experiment 1w
- (shows resource consumption of all job runs, pretty noisy but putting this here for completeness)

Keep this open for at least 24h of weekday PR traffic. If everything continues to look good, then this can be closed.

/wg k8s-infra /sig testing /area jobs /help

neolit123 commented 4 years ago

/assign /remove-help

neolit123 commented 4 years ago

PR: https://github.com/kubernetes/test-infra/pull/18872

neolit123 commented 4 years ago

observations:

does the job start failing more often?

the failure rate does not seem to have decreased. given this job is testing GA-only, during code freeze i'd assume flakes, instead of changes in features causing the job to fail.

does the job duration look worse than before? spikier than before?

the overall duration seems to have decreased to ~20minutes on average from ~26/30 minutes?

do more failures show up than before?

doesn't seem like it, possibly more runs are needed on the new cluster to say for sure.

is the job wildly underutilizing its memory limit? if so, perhaps tune down (if uncertain, post evidence in this issue and ask)

i'd assume this kind cluster has 1CP2W nodes. https://github.com/kubernetes/test-infra/pull/18872/files#diff-e4b92f7fa3467cd10631b29b58d683daR290-R298 the job requests 4 cpus, 9gi. my estimate is that this is not very underutilized, but possibly the res. requests can be reduced. potential experiment here is to reduce them to half, but keep the limits.

neolit123 commented 4 years ago

looks like i need to PR myself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 to see the rest of the links.

EDIT: https://github.com/kubernetes/k8s.io/pull/1165

neolit123 commented 4 years ago

metrics explorer

i hope i'm reading the data correctly. it seems the memory and CPU "limit utilization" is around a maximum of ~0.5-0.6, without spikes above ~0.8 (1.0 == limit). this seems fine - can see minor adjustments but resources are not very underutilized.

cpu memory

spiffxp commented 4 years ago

/close Agreed, this looks good. Thanks for your help!

k8s-ci-robot commented 4 years ago

@spiffxp: Closing this issue.

In response to [this](https://github.com/kubernetes/test-infra/issues/18850#issuecomment-683195379): >/close >Agreed, this looks good. Thanks for your help! Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / test-infra

Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-conformance-kind-ga-only-parallel #18850