kubernetes / test-infra

Test infrastructure for the Kubernetes project.
Apache License 2.0
3.83k stars 2.65k forks source link

Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-node-e2e #18851

Closed spiffxp closed 4 years ago

spiffxp commented 4 years ago

What should be cleaned up or changed:

This is part of #18550

To properly monitor the outcome of this, you should be a member of k8s-infra-prow-viewers@kubernetes.io. PR yourself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 if you're not a member.

Migrate pull-kubernetes-node-e2e to k8s-infra-prow-build by adding a cluster: k8s-infra-prow-build field to the job:

NOTE: migrating this job is not as straightforward as some of the other #18550 issues, because we also need to:

Once the PR has merged, note the date/time it merged. This will allow you to compare before/after behavior.

Things to watch for the job

Things to watch for the build cluster

Keep this open for at least 24h of weekday PR traffic. If everything continues to look good, then this can be closed.

/wg k8s-infra /sig testing /area jobs /help

spiffxp commented 4 years ago

/remove-help /assign

spiffxp commented 4 years ago

Opened https://github.com/kubernetes/test-infra/pull/18915

RobertKielty commented 4 years ago

@spiffxp I marked this as In Progress based on #18915 having merged.

When can we call this complete?

spiffxp commented 4 years ago

PR merged 2020-08-19, which is too far ago to be able to cleanly show before/after data using testgrid or prow.k8s.io

From a local grafana instance I have that runs queries against k8s-gubernator:build, it looks like the job runs more reliably and with a comparable failure rate under load.

Screen Shot 2020-09-09 at 3 40 48 PM

https://storage.googleapis.com/k8s-gubernator/triage/index.html?date=2020-08-30&pr=1&job=pull-kubernetes-node-e2e

A screenshot of triage from 2020-08-30 is early enough to pick up the before/after performance, and things look no worse that I can see. I'm guessing the spike of failures immediately after is unrelated, or has been corrected since then Screen Shot 2020-09-09 at 3 43 25 PM

CPU limit usage

CPU limit looks reasonable. As with other jobs, we need most of the CPU up front for building; in the case all the testing cpu usage happens on nodes spun up elsewhere. If we had a shared build we could take the CPU requirements way down. Screen Shot 2020-09-09 at 3 47 19 PM

Memory limit usage

Same story with memory limit usage Screen Shot 2020-09-09 at 3 50 28 PM

spiffxp commented 4 years ago

/close I think this is good enough

Apologies for falling behind on this one, it should have been in Monitoring, and I just didn't have time to sit still and check in on it until now.

k8s-ci-robot commented 4 years ago

@spiffxp: Closing this issue.

In response to [this](https://github.com/kubernetes/test-infra/issues/18851#issuecomment-689864856): >/close >I think this is good enough > >Apologies for falling behind on this one, it should have been in Monitoring, and I just didn't have time to sit still and check in on it until now. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.