Concurrently reconcile CloudStackMachine resources

chrisdoherty4 commented 1 year ago

AWS analyzed CAPC in high node count contexts and found it takes considerable time to scale clusters. Part of the issue stems from CloudStackMachine resources being reconciled serially. This change enables concurrent reconciliation of CloudStackMachine resources improving the efficiency and preventing other parts of the system from reacting to slowness.

I have tested these changes by scaling up and down a machine deployment from 1 to 11 nodes. Scale ups took comparable times (55s) vs serial reconciliation which is expected as most of the time is consumed by VM provisioning. Scale down had an 85% improvement from 1m57s to 27s.

Related #274

k8s-ci-robot commented 1 year ago

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

k8s-ci-robot commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrisdoherty4

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes-sigs/cluster-api-provider-cloudstack/blob/main/OWNERS)~~ [chrisdoherty4] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

netlify[bot] commented 1 year ago

Deploy Preview for kubernetes-sigs-cluster-api-cloudstack ready!

Name	Link
Latest commit	9f73daea8c148f6067e8144fe330bff83c835f43
Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-cluster-api-cloudstack/deploys/64b1b1fa22dde80007b9dc8b
Deploy Preview	https://deploy-preview-290--kubernetes-sigs-cluster-api-cloudstack.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

codecov-commenter commented 1 year ago

Codecov Report

Patch coverage has no change and project coverage change: -0.05 :warning:

Comparison is base (4ccf853) 25.29% compared to head (9f73dae) 25.25%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #290 +/- ## ========================================== - Coverage 25.29% 25.25% -0.05% ========================================== Files 59 59 Lines 5585 5582 -3 ========================================== - Hits 1413 1410 -3 Misses 4035 4035 Partials 137 137 ``` | [Impacted Files](https://app.codecov.io/gh/kubernetes-sigs/cluster-api-provider-cloudstack/pull/290?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=kubernetes-sigs) | Coverage Δ | | |---|---|---| | [controllers/cloudstackmachine\_controller.go](https://app.codecov.io/gh/kubernetes-sigs/cluster-api-provider-cloudstack/pull/290?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=kubernetes-sigs#diff-Y29udHJvbGxlcnMvY2xvdWRzdGFja21hY2hpbmVfY29udHJvbGxlci5nbw==) | `54.85% <ø> (ø)` | | | [pkg/cloud/instance.go](https://app.codecov.io/gh/kubernetes-sigs/cluster-api-provider-cloudstack/pull/290?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=kubernetes-sigs#diff-cGtnL2Nsb3VkL2luc3RhbmNlLmdv) | `82.38% <ø> (-0.16%)` | :arrow_down: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

chrisdoherty4 commented 1 year ago

/run-e2e -c 4.18

g-gaston commented 1 year ago

/lgtm

g-gaston commented 1 year ago

/hold

chrisdoherty4 commented 1 year ago

/hold

chrisdoherty4 commented 1 year ago

The E2E don't seem to be getting kicked off?

/assign @vishesh92 @weizhouapache

k8s-ci-robot commented 1 year ago

@chrisdoherty4: GitHub didn't allow me to assign the following users: vishesh92.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to [this](https://github.com/kubernetes-sigs/cluster-api-provider-cloudstack/pull/290#issuecomment-1636454513): >The E2E don't seem to be getting kicked off? > >/assign @vishesh92 @weizhouapache > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

rohityadavcloud commented 1 year ago

@chrisdoherty4 it's possible the backend BO script is facing Github rate limits (we found recently the script/jenkins jobs are running but couldn't post results as Github rate limit would void the API request to post comments for some reason)

/run-e2e help

blueorangutan commented 1 year ago

@rohityadavcloud The command to run e2e test for CAPC.

Usage: /run-e2e [-k Kubernetes_Version] [-c CloudStack_Version] [-h Hypervisor] [-i Template/Image] [-f Kubernetes_Version_Upgrade_From] [-t Kubernetes_Version_Upgrade_To]

Supported Kubernetes versions are: ['1.27.2', '1.26.5', '1.25.10', '1.24.14', '1.23.3', '1.22.6']. The default value is '1.27.2'.
Supported CloudStack versions are: ['4.18', '4.17', '4.16']. If it is not set, an existing environment will be used.
Supported hypervisors are: ['kvm', 'vmware', 'xen']. The default value is 'kvm'.
Supported templates are: ['ubuntu-2004-kube', 'rockylinux-8-kube']. The default value is 'ubuntu-2004-kube'.
By default it tests Kubernetes upgrade from version '1.26.5' to '1.27.2'.

Examples:

/run-e2e
/run-e2e -k 1.27.2 -h kvm -i ubuntu-2004-kube
/run-e2e -k 1.27.2 -c 4.18 -h kvm -i ubuntu-2004-kube -f 1.26.5 -t 1.27.2

rohityadavcloud commented 1 year ago

/run-e2e -c 4.18

blueorangutan commented 1 year ago

@rohityadavcloud a jenkins job has been kicked to run test with following paramaters:

kubernetes version: 1.27.2
CloudStack version: 4.18
hypervisor: kvm
template: ubuntu-2004-kube
Kubernetes upgrade from: 1.26.5 to 1.27.2

blueorangutan commented 1 year ago

Test Results : (tid-272) Environment: kvm Rocky8(x3), Advanced Networking with Management Server Rocky8 Kubernetes Version: v1.27.2 Kubernetes Version upgrade from: v1.26.5 Kubernetes Version upgrade to: v1.27.2 CloudStack Version: 4.18 Template: ubuntu-2004-kube E2E Test Run Logs: https://github.com/blueorangutan/capc-prs/releases/download/capc-pr-ci-cd/capc-e2e-artifacts-pr290-sl-272.zip

[PASS] When testing node drain timeout A node should be forcefully removed if it cannot be drained in time
[PASS] with two clusters should successfully add and remove a second cluster without breaking the first cluster
[PASS] When testing subdomain Should create a cluster in a subdomain
[PASS] When testing app deployment to the workload cluster with network interruption [ToxiProxy] Should be able to create a cluster despite a network interruption during that process
[PASS] When testing affinity group Should have host affinity group when affinity is anti
[PASS] When testing machine remediation Should replace a machine when it is destroyed
[PASS] When testing Kubernetes version upgrades Should successfully upgrade kubernetes versions when there is a change in relevant fields
[PASS] When testing K8S conformance [Conformance] Should create a workload cluster and run kubetest
[PASS] When testing MachineDeployment rolling upgrades Should successfully upgrade Machines upon changes in relevant MachineDeployment fields
[PASS] When testing with custom disk offering Should successfully create a cluster with a custom disk offering
[PASS] When testing multiple CPs in a shared network with kubevip Should successfully create a cluster with multiple CPs in a shared network
[PASS] When testing with disk offering Should successfully create a cluster with disk offering
[PASS] When the specified resource does not exist Should fail due to the specified account is not found [TC4a]
[PASS] When the specified resource does not exist Should fail due to the specified domain is not found [TC4b]
[PASS] When the specified resource does not exist Should fail due to the specified control plane offering is not found [TC7]
[PASS] When the specified resource does not exist Should fail due to the specified template is not found [TC6]
[PASS] When the specified resource does not exist Should fail due to the specified zone is not found [TC3]
[PASS] When the specified resource does not exist Should fail due to the specified disk offering is not found
[PASS] When the specified resource does not exist Should fail due to the compute resources are not sufficient for the specified offering [TC8]
[PASS] When the specified resource does not exist Should fail due to the specified disk offer is not customized but the disk size is specified
[PASS] When the specified resource does not exist Should fail due to the specified disk offer is customized but the disk size is not specified
[PASS] When the specified resource does not exist Should fail due to the public IP can not be found
[PASS] When the specified resource does not exist When starting with a healthy cluster Should fail to upgrade worker machine due to insufficient compute resources
[PASS] When the specified resource does not exist When starting with a healthy cluster Should fail to upgrade control plane machine due to insufficient compute resources
[PASS] When testing horizontal scale out/in [TC17][TC18][TC20][TC21] Should successfully scale machine replicas up and down horizontally
[PASS] When testing app deployment to the workload cluster with slow network [ToxiProxy] Should be able to download an HTML from the app deployed to the workload cluster

Summarizing 3 Failures:

[Fail] When testing affinity group [It] Should have host affinity group when affinity is pro 
/jenkins/workspace/capc-e2e-new/test/e2e/common.go:331

[Fail] When testing resource cleanup [AfterEach] Should create a new network when the specified network does not exist 
/jenkins/workspace/capc-e2e-new/test/e2e/resource_cleanup.go:101

[Fail] When testing app deployment to the workload cluster [TC1][PR-Blocking] [It] Should be able to download an HTML from the app deployed to the workload cluster 
/jenkins/workspace/capc-e2e-new/test/e2e/deploy_app.go:111

Ran 28 of 29 Specs in 9307.217 seconds
FAIL! -- 25 Passed | 3 Failed | 0 Pending | 1 Skipped
--- FAIL: TestE2E (9307.23s)
FAIL

chrisdoherty4 commented 1 year ago

/run-e2e -c 4.18

blueorangutan commented 1 year ago

@chrisdoherty4 a jenkins job has been kicked to run test with following paramaters:

kubernetes version: 1.27.2
CloudStack version: 4.18
hypervisor: kvm
template: ubuntu-2004-kube
Kubernetes upgrade from: 1.26.5 to 1.27.2

chrisdoherty4 commented 1 year ago

/uncc @davidjumani

blueorangutan commented 1 year ago

Test Results : (tid-273) Environment: kvm Rocky8(x3), Advanced Networking with Management Server Rocky8 Kubernetes Version: v1.27.2 Kubernetes Version upgrade from: v1.26.5 Kubernetes Version upgrade to: v1.27.2 CloudStack Version: 4.18 Template: ubuntu-2004-kube E2E Test Run Logs: https://github.com/blueorangutan/capc-prs/releases/download/capc-pr-ci-cd/capc-e2e-artifacts-pr290-sl-273.zip

[PASS] When testing with disk offering Should successfully create a cluster with disk offering
[PASS] When testing app deployment to the workload cluster [TC1][PR-Blocking] Should be able to download an HTML from the app deployed to the workload cluster
[PASS] When testing with custom disk offering Should successfully create a cluster with a custom disk offering
[PASS] When testing horizontal scale out/in [TC17][TC18][TC20][TC21] Should successfully scale machine replicas up and down horizontally
[PASS] with two clusters should successfully add and remove a second cluster without breaking the first cluster
[PASS] When testing app deployment to the workload cluster with network interruption [ToxiProxy] Should be able to create a cluster despite a network interruption during that process
[PASS] When testing K8S conformance [Conformance] Should create a workload cluster and run kubetest
[PASS] When testing multiple CPs in a shared network with kubevip Should successfully create a cluster with multiple CPs in a shared network
[PASS] When testing machine remediation Should replace a machine when it is destroyed
[PASS] When testing subdomain Should create a cluster in a subdomain
[PASS] When the specified resource does not exist Should fail due to the specified account is not found [TC4a]
[PASS] When the specified resource does not exist Should fail due to the specified domain is not found [TC4b]
[PASS] When the specified resource does not exist Should fail due to the specified control plane offering is not found [TC7]
[PASS] When the specified resource does not exist Should fail due to the specified template is not found [TC6]
[PASS] When the specified resource does not exist Should fail due to the specified zone is not found [TC3]
[PASS] When the specified resource does not exist Should fail due to the specified disk offering is not found
[PASS] When the specified resource does not exist Should fail due to the compute resources are not sufficient for the specified offering [TC8]
[PASS] When the specified resource does not exist Should fail due to the specified disk offer is not customized but the disk size is specified
[PASS] When the specified resource does not exist Should fail due to the specified disk offer is customized but the disk size is not specified
[PASS] When the specified resource does not exist Should fail due to the public IP can not be found
[PASS] When the specified resource does not exist When starting with a healthy cluster Should fail to upgrade worker machine due to insufficient compute resources
[PASS] When the specified resource does not exist When starting with a healthy cluster Should fail to upgrade control plane machine due to insufficient compute resources
[PASS] When testing affinity group Should have host affinity group when affinity is anti
[PASS] When testing resource cleanup Should create a new network when the specified network does not exist
[PASS] When testing node drain timeout A node should be forcefully removed if it cannot be drained in time
[PASS] When testing Kubernetes version upgrades Should successfully upgrade kubernetes versions when there is a change in relevant fields
[PASS] When testing MachineDeployment rolling upgrades Should successfully upgrade Machines upon changes in relevant MachineDeployment fields
[PASS] When testing app deployment to the workload cluster with slow network [ToxiProxy] Should be able to download an HTML from the app deployed to the workload cluster

Summarizing 1 Failure:

[Fail] When testing affinity group [It] Should have host affinity group when affinity is pro 
/jenkins/workspace/capc-e2e-new/test/e2e/common.go:331

Ran 28 of 29 Specs in 8523.486 seconds
FAIL! -- 27 Passed | 1 Failed | 0 Pending | 1 Skipped
--- FAIL: TestE2E (8523.49s)
FAIL

chrisdoherty4 commented 1 year ago

The failing affinity E2E is also failing on main so is not an error introduced by this change.

chrisdoherty4 commented 1 year ago

/unhold

kubernetes-sigs / cluster-api-provider-cloudstack