Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-node-e2e

spiffxp commented 4 years ago

What should be cleaned up or changed:

This is part of #18550

To properly monitor the outcome of this, you should be a member of k8s-infra-prow-viewers@kubernetes.io. PR yourself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 if you're not a member.

Migrate pull-kubernetes-node-e2e to k8s-infra-prow-build by adding a cluster: k8s-infra-prow-build field to the job:

Make sure any release-branch variants are also updated
Reference this issue, but do not "fix" it; instead leave this open to monitor the behavior of the job before/after
Here's an example PR that did this for pull-kubernetes-integration: https://github.com/kubernetes/test-infra/pull/18814

NOTE: migrating this job is not as straightforward as some of the other #18550 issues, because we also need to:

Switch away from a fixed GCP project to boskos-managed projects
- Replace --gcp-project=k8s-jkns-pr-node-e2e with --gcp-project-type=gce-project
If this turns out to break things, revert and ask for help

Once the PR has merged, note the date/time it merged. This will allow you to compare before/after behavior.

Things to watch for the job

https://prow.k8s.io/?job=pull-kubernetes-node-e2e
- does the job start failing more often?
- does the job start going into error state?
https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-node-e2e&graph-metrics=test-duration-minutes
- does the job duration look worse than before? spikier than before?
https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=pull-kubernetes-node-e2e
- do more failures show up than before?
https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-e2e
- (can be used to answer some of the same questions as above)
metrics explorer: CPU limit utilization for pull-kubernetes-node-e2e for 6h
- is the job wildly underutilizing its CPU limit? if so, perhaps tune down (if uncertain, post evidence in this issue and ask)
- (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)
metrics explorer: Memory limit utilization for pull-kubernetes-node-e2e for 6h
- is the job wildly underutilizing its memory limit? if so, perhaps tune down (if uncertain, post evidence in this issue and ask)
- (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)

Things to watch for the build cluster

prow-build dashboard 1w
- is the build cluster scaling as needed? (e.g. maybe it can't scale because we've hit some kind of quota)
- (it will probably be helpful to look at different time resolutions like 1h, 6h, 1d, 1w)
prowjobs-experiment 1w
- (shows resource consumption of all job runs, pretty noisy but putting this here for completeness)
https://monitoring.prow.k8s.io/d/wSrfvNxWz/boskos-resource-usage?orgId=1
- are there still plenty of free projects in the "GCE project (k8s-infra) usage status" graph?

Keep this open for at least 24h of weekday PR traffic. If everything continues to look good, then this can be closed.

/wg k8s-infra /sig testing /area jobs /help

spiffxp commented 4 years ago

/remove-help /assign

spiffxp commented 4 years ago

Opened https://github.com/kubernetes/test-infra/pull/18915

RobertKielty commented 4 years ago

@spiffxp I marked this as In Progress based on #18915 having merged.

When can we call this complete?

spiffxp commented 4 years ago

PR merged 2020-08-19, which is too far ago to be able to cleanly show before/after data using testgrid or prow.k8s.io

From a local grafana instance I have that runs queries against k8s-gubernator:build, it looks like the job runs more reliably and with a comparable failure rate under load.

https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-08-30&pr=1&job=pull-kubernetes-node-e2e

A screenshot of triage from 2020-08-30 is early enough to pick up the before/after performance, and things look no worse that I can see. I'm guessing the spike of failures immediately after is unrelated, or has been corrected since then Screen Shot 2020-09-09 at 3 43 25 PM

CPU limit usage

CPU limit looks reasonable. As with other jobs, we need most of the CPU up front for building; in the case all the testing cpu usage happens on nodes spun up elsewhere. If we had a shared build we could take the CPU requirements way down. Screen Shot 2020-09-09 at 3 47 19 PM

Memory limit usage

Same story with memory limit usage Screen Shot 2020-09-09 at 3 50 28 PM

spiffxp commented 4 years ago

/close I think this is good enough

Apologies for falling behind on this one, it should have been in Monitoring, and I just didn't have time to sit still and check in on it until now.

k8s-ci-robot commented 4 years ago

@spiffxp: Closing this issue.

In response to [this](https://github.com/kubernetes/test-infra/issues/18851#issuecomment-689864856): >/close >I think this is good enough > >Apologies for falling behind on this one, it should have been in Monitoring, and I just didn't have time to sit still and check in on it until now. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes / test-infra

Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-node-e2e #18851