Scalability tests for beta releases

alejandrox1 commented 4 years ago

Current state of affairs: We have the following jobs to gauge the quality of the current release

These run against the latest on the master branch of k/k. These jobs provide critical signal during the release cycle. However, after code freeze, when we reopen the master branch for the next release, we may occasionally cherry pick multiple commits from master to the release-x.y branch. During this period, between code thaw and the official release-x.y, we occasionally see failures in our master-informing scalability jobs and are unsure if the changes that brought on the failure are have been cherry picked into the release-x.y branch.

The thing I want to bring upfor discussion in this issue is the possibility of creating scalability jobs for the beta release (the version of the Kubernetes code from code thaw until the official release). An additional caveat is that besides testing a certain portion of the lubernetes source code (the contents of the release-X.Y branch from code thaw to release) we may also have to set up the tests to run with the equivalent version of https://github.com/kubernetes/perf-tests (to make sure changes to this repo dont obscure signal from k/k). In short, what do you all think?

Additional resources:

https://github.com/kubernetes/perf-tests/issues/797 Figure out a mechanism to avoid breaking tests in k/k repo

/cc @kubernetes/sig-scalability-feature-requests /cc @kubernetes/release-team @kubernetes/release-engineering /sig release /sig scalability /priority important-longterm /milestone v1.18

justaugustus commented 4 years ago

xref 1.15 Retro AIs: https://github.com/kubernetes/sig-release/issues/806

alejandrox1 commented 4 years ago

/cc @wojtek-t @mm4tt would love to hear your thoughts on this proposal

wojtek-t commented 4 years ago

While in general I support it, currently we don't have resources to run this. We're working on speeding up our tests (among others, we're trying to merge existing two tests into a single one: https://github.com/kubernetes/perf-tests/pull/1008). We believe, we will be able to speed up our 5k-node tests to take hopefully less than 8 hours. Once this is done, we should be able to leave the 1-day frequency, but run the job for both master and k8s-beta release.

But that still requires non-trivial amount of work to do on our side to speed up our tests.

alejandrox1 commented 4 years ago

Thank you @wojtek-t for your comment. The work you are doing to speed up the tests would be of great help for us.

One thing that has come up a couple times, please correct me if Im wrong, is that Google is running all scalability jobs, correct? If possible, we would like to (at some point) move these jobs into CNCF infrastructure.

In order to do this we would need some idea of what we would need to run these tests. For example billing information that could be shared with us and wg-k8s. This way we can ask and possibly start planning how we can make the move. What do you think?

wojtek-t commented 4 years ago

One thing that has come up a couple times, please correct me if Im wrong, is that Google is running all scalability jobs, correct? If possible, we would like to (at some point) move these jobs into CNCF infrastructure.

I think the release-blocking ones has already been moved. But TBH I don't know how to confirm it. They are running in kubernetes-scale project - do you know how to check if this was already transferred to CNCF?

mm4tt commented 4 years ago

Sorry for not jumping in earlier, I was OOO. FTR, we run 100 gce node and 500 kubemark node tests continuously on all active release branches - https://k8s-testgrid.appspot.com/sig-scalability-gce. Also recently we've started running the same tests as presubmits in those branches. They are not as sensitive as 5k node tests, but treating them (gce 100 node tests) as beta release blocking tests might be a good intermediate solution (if you don't do that already).

alejandrox1 commented 4 years ago

Sorry for the super delayed response on this: between release team and checking on the current state of infra I lost track of time but couple details...

I think the release-blocking ones has already been moved. But TBH I don't know how to confirm it. They are running in kubernetes-scale project - do you know how to check if this was already transferred to CNCF?

Currently, runs in CI borrow (at least for GCP-based jobs) projects/credentials from Boskos. I went around asking wg k8s infra and it seems that all credentials (projects) are currently owned by Google.

bentheelder Prow itself is on Google infra. Most of the infra used from Prow is in google.com GCP projects. A little bit is not. Including:

AWS accounts via CNCF (formerly Google, not for a year or so now)

GCB projects for some release automation, which are Google funded CNCF owned

Jobs execute arbitrary code though, so some of them could be using some other infra Like I think we might actually still have some jobs running EKS tests :face_with_rolling_eyes: And some of the windows testing from Azure involves images built out of band by some Azure owned process we know nothing about :upside_down_face:

alejandrox1 The pool of GCP creds that boskos uses, are all of those google-owned accounts?

bentheelder Yes. Also fwiw boskos only hands out project names, the same credential owns all of them :witnessprotectionparrot:

So I guess the work involved in moving scalability tests onto CNCF resources would involve some tweaking on Boskos, in which case it would land us (SIG scalability and release) to work with wg-k8s-infra and figure this out. wdyt @justaugustus @mm4tt @wojtek-t ? Should we proceed with this and try and figure out a way to use CNCF resources for the 5k scalability job?

FTR, we run 100 gce node and 500 kubemark node tests continuously on all active release branches - https://k8s-testgrid.appspot.com/sig-scalability-gce. Also recently we've started running the same tests as presubmits in those branches. They are not as sensitive as 5k node tests, but treating them (gce 100 node tests) as beta release blocking tests might be a good intermediate solution (if you don't do that already).

Thank you Matt for mentioning these. We, release team, do consider these jobs release blocking as well.

mm4tt commented 4 years ago

So I guess the work involved in moving scalability tests onto CNCF resources would involve some tweaking on Boskos, in which case it would land us (SIG scalability and release) to work with wg-k8s-infra and figure this out. wdyt Stephen Augustus Matt Matejczyk Wojciech Tyczynski ? Should we proceed with this and try and figure out a way to use CNCF resources for the 5k scalability job?

Yeah, we should do that. I alway thought these tests were already transferred to CNCF. Let me know what I can do to help with the transfer

wojtek-t commented 4 years ago

@alejandrox1 - are you really sure we rely on Boscos for 5k-node tests? We don't use pool of projects - we explicitly set the project for those (there is exactly one predefined project): https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml#L29

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

wojtek-t commented 4 years ago

/remove-lifecycle stale

I would like us to get there, but we won't in 1.19 timeframe. Hopefully 1.20...

alejandrox1 commented 4 years ago

coming back to this one (excuse the delay). Couple thing to mention - scalability 5k job does not need boskos as @wojtek-t mention.

I think the way forward is to work with wg-k8s-infra . There is an open issue on identifying the infrastructure needed to run scalability jobs in cncf resources https://github.com/kubernetes/k8s.io/issues/851 So I guess we can collaborate with that, move the existing scalability job over to cncf resources, and then work on this one. /milestone v1.20

mm4tt commented 4 years ago

Sounds good, let me know if there is anything I can help with

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

wojtek-t commented 3 years ago

/remove-lifecycle stale

LappleApple commented 3 years ago

Heya @alejandrox1, are you still working on this?

justaugustus commented 3 years ago

/assign @jeremyrickard /unassign @alejandrox1 /milestone v1.22

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

wojtek-t commented 3 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t commented 2 years ago

/remove-lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

wojtek-t commented 2 years ago

/remove-lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t commented 2 years ago

/remove-lifecycle stale /kind bug /triage accepted

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The issue has been marked as an important bug and triaged. Such issues are automatically marked as frozen when hitting the rotten state to avoid missing important bugs.

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle frozen

jeremyrickard commented 2 years ago

Picking this up for v1.26

/milestone v1.26

k8s-triage-robot commented 8 months ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

wojtek-t commented 8 months ago

/triage accepted

kubernetes / sig-release

Scalability tests for beta releases #908