kubernetes / test-infra

Test infrastructure for the Kubernetes project.
Apache License 2.0
3.83k stars 2.65k forks source link

Migrate merge-blocking jobs to dedicated cluster: pull-kubernetes-e2e-gce #18852

Closed spiffxp closed 4 years ago

spiffxp commented 4 years ago

What should be cleaned up or changed:

This is part of #18550

To properly monitor the outcome of this, you should be a member of k8s-infra-prow-viewers@kubernetes.io. PR yourself into https://github.com/kubernetes/k8s.io/blob/master/groups/groups.yaml#L603-L628 if you're not a member.

Migrate pull-kubernetes-e2e-gce to k8s-infra-prow-build by adding a cluster: k8s-infra-prow-build field to the job:

NOTE: migrating this job is not as straightforward as some of the other #18550 issues, because:

Once the PR has merged, note the date/time it merged. This will allow you to compare before/after behavior.

Things to watch for the job

Things to watch for the build cluster

Keep this open for at least 24h of weekday PR traffic. If everything continues to look good, then this can be closed.

/wg k8s-infra /sig testing /area jobs /help

spiffxp commented 4 years ago

/remove-help /assign

spiffxp commented 4 years ago

Tangentially related, it would be nice to know if we even need to use --stage=gs://kubernetes-release-pull (ref https://github.com/kubernetes/test-infra/issues/18789). I already migrated over pull-kubernetes-e2e-gce-ubuntu-containerd which uses it, so I'll do the same here. But would then like to remove it if it's not needed, or migrate to kubernetes.io-owned gs://k8s-release-pull if it's needed

spiffxp commented 4 years ago

Opened https://github.com/kubernetes/test-infra/pull/18916

The main branch and 1.19 variants aren't merge-blocking anymore, but earlier branches are. Moving them all over

spiffxp commented 4 years ago

https://github.com/kubernetes/test-infra/pull/18916 merged 2020-08-19 16:40 PT

https://prow.k8s.io/?job=pull-kubernetes-e2e-gce - shows a reasonable amount of traffic since there is now a push to get PR's landed in time for the final cut of kubernetes v1.16. The only failures appear to be flakes

https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-gce&graph-metrics=test-duration-minutes - overall the job duration is less spiky and has maybe gone slightly down over time

https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=pull-kubernetes-e2e-gce%24 - no real change in errors

https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce - Seeing https://github.com/kubernetes/test-infra/issues/19034, would like to understand whether this job caused that issue or something else

cpu utilization - big spikes in the beginning for build, then nothing Screen Shot 2020-08-31 at 5 56 05 PM

memory utilization - looks like that's about right Screen Shot 2020-08-31 at 5 58 44 PM

So if it turn out https://github.com/kubernetes/test-infra/issues/19034 is unrelated to this change, we're good. But need to dig into that a little more first

RobertKielty commented 4 years ago

@spiffxp I this moved to In Progress. Will have a look at #19034 ...

snowmanstark commented 4 years ago

@RobertKielty @spiffxp I would like to work on this issue

snowmanstark commented 4 years ago

@spiffxp can you help me understand what does the following mean and how this affects the changes to be made for this issue

it's being demoted from merge-blocking on release-1.19 and the main branch (as of #18832)

spiffxp commented 4 years ago

@snowmanstark

So, the changes have already been made via https://github.com/kubernetes/test-infra/pull/18916 (see https://github.com/kubernetes/test-infra/issues/18852#issuecomment-676792471)

The reason this is still open is because https://github.com/kubernetes/test-infra/issues/19034 is unexplained, and maybe happened around the same time https://github.com/kubernetes/test-infra/pull/18916 merged? If we can either prove that https://github.com/kubernetes/test-infra/pull/18916 didn't cause it (see https://github.com/kubernetes/test-infra/issues/19034#issuecomment-684130355), or if we can fix https://github.com/kubernetes/test-infra/issues/19034, then this issue can be closed.


To answer your question

https://github.com/kubernetes/test-infra/blob/6e1d254ade45de4d6101578c9498daa045be6f69/config/tests/jobs/jobs_test.go#L983-L989

https://github.com/kubernetes/test-infra/pull/18832 set always_run to false for the main branch when v1.19 was under development, and the release-1.19 branch. There is no run_if_changed for it, thus it's not considered merge-blocking for those branches.

It is still merge blocking for older branches (release-1.18, release-1.17), as we generally don't backport policy or test changes back to already-released versions of kubernetes except under special circumstances.

The reason this complicates things is the job wouldn't see as much traffic as jobs that always run for all branches, so it's tougher to avoid variance due to a smaller sample-set size, and thus tougher to make a judgement call on "does everything still look OK."

However, I saw enough traffic in https://github.com/kubernetes/test-infra/issues/18852#issuecomment-684128619 when cherry picks were being swept through in advance of upcoming patch releases. So aside from the question of https://github.com/kubernetes/test-infra/issues/19034 I think this looks good

snowmanstark commented 4 years ago

Thanks @spiffxp for that explanation. It makes total sense to me now. I'll look into #19034 too to get this closed.

snowmanstark commented 4 years ago

@spiffxp I looked into #19034 and nothings seems to be off there.

RobertKielty commented 4 years ago

Hi @spiffxp can this issue be closed now?

I've updated #19034

We need to review #19034 for sure but I'm confused as to how both these issues are related?

spiffxp commented 4 years ago

Per https://github.com/kubernetes/test-infra/issues/18852#issuecomment-692196542 the reason I held this open is because I'm still not certain that migration of this job did not cause https://github.com/kubernetes/test-infra/issues/19034. But we've lived with it unresolved for about 90d now, so I guess we can live with it unexplained for longer.

/close

k8s-ci-robot commented 4 years ago

@spiffxp: Closing this issue.

In response to [this](https://github.com/kubernetes/test-infra/issues/18852#issuecomment-722076336): >Per https://github.com/kubernetes/test-infra/issues/18852#issuecomment-692196542 the reason I held this open is because I'm still not certain that migration of this job did not cause https://github.com/kubernetes/test-infra/issues/19034. But we've lived with it unresolved for about 90d now, so I guess we can live with it unexplained for longer. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.