kubernetes / test-infra

Test infrastructure for the Kubernetes project.
Apache License 2.0
3.82k stars 2.64k forks source link

[WIP] Add dashboard and perf test for Azure under SIG Scalability #32850

Closed Jont828 closed 3 months ago

Jont828 commented 3 months ago

/hold

k8s-ci-robot commented 3 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Jont828 Once this PR has been reviewed and has the lgtm label, please assign mpherman2 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubernetes/test-infra/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
k8s-ci-robot commented 3 months ago

@Jont828: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-test-infra-unit-test c60a8043753dd2470e4cf46c5197846959e9a897 link true /test pull-test-infra-unit-test
pull-test-infra-unit-test-race-detector-nonblocking c60a8043753dd2470e4cf46c5197846959e9a897 link false /test pull-test-infra-unit-test-race-detector-nonblocking

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
jackfrancis commented 3 months ago

@BenTheElder thanks for jumping in

@mboersma and @Jont828 introduced this effort at a high level to SIG Scalability last week (ref: https://docs.google.com/document/d/1hEpf25qifVWztaeZPFmjNiJvPo-5JX1z0LSvvVY5G2g/edit#bookmark=kix.70akioyk9t2h) Agenda notes suggest that maybe the community infra aspect wasn't discussed. Sounds like we should keep SIG Testing in the loop on that and any ideas for future progress here? Happy to do so.

BenTheElder commented 3 months ago

. Sounds like we should keep SIG Testing in the loop on that and any ideas for future progress here? Happy to do so.

More like SIG K8s Infra, and you don't have to loop us in, but ... at the moment this just won't work as-written, since the build cluster selected doesn't have azure access yet, @jsturtevant is working with @upodroid on that part. You could switch to the google.com "default" cluster but ... we're phasing that out.

The part that may need discussing with k8s infra / Azure: It's not clear to me if this will fit, I don't know if the new budget has been disclosed yet (AFAIK there will not be an announcement from Microsoft, I think privately we have some idea) but at some point we'll be tracking under these updates https://kubernetes.slack.com/archives/CCK68P2Q2/p1719280357681539 where we will be posting the usage and remaining resources on the commuonity account.

... We'll want to make sure we can migrate all of the existing azure jobs or risk dropping them (https://groups.google.com/a/kubernetes.io/g/dev/c/p6PAML90ZOU)

So I'd personally suggest holding off making a large new Azure job until the effort to migrate the existing usage to community infra is complete.

We don't plan to defer migrating Prow any longer, August 1 we will stop running unmigrated jobs. We've been attempting to migrate prow for years and have it ~almost there, and the engprod team running it will only be supporting through EOY so ... we kinda need to wrap this up and we're aiming to not port forward any dependency on any other internal resources into the community clusters. In the past / present this has included AWS, GCP, DO, Azure, VSphere, ... but it's almost done and Azure in particular is being actively worked on.

I don't want to block anyone from making progress, but we're in a kind of awkward spot here and I think it probably makes sense to hold off for a little bit and/or see if you can help resolve the migration issues first.

BenTheElder commented 3 months ago

... reached out to clarify the budget, though the project also currently has zero visibility into the spend on the current microsoft internal account available in the original google.com "default" CI cluster, so even once we know the budget we will have no idea how much capacity remains until everything switches over

jackfrancis commented 3 months ago

So I'd personally suggest holding off making a large new Azure job until the effort to migrate the existing usage to community infra is complete.

This sounds reasonable to me!

We can make progress on our end w/ clusterloader2 as we ramp up towards having public testing on the new Azure infra.

Jont828 commented 3 months ago

@BenTheElder Thanks for the feedback, I brought it up at the SIG scalability meeting and didn't realize there would be further considerations w.r.t. community infra. I'll go ahead and close the PR and get some more info about public testing on Azure infra.