kubernetes-sigs / cloud-provider-azure

Cloud provider for Azure
https://cloud-provider-azure.sigs.k8s.io/
Apache License 2.0
256 stars 269 forks source link

[release-1.27 | release-1.28] fix: enable the `cloud-node-controller` #6498

Closed aneesh1 closed 3 days ago

aneesh1 commented 3 days ago

What type of PR is this?

/kind bug

What this PR does / why we need it:

Overview

In v1.26, the cloud-node-controller was running its reconciliation logic in the cloud-controller-manager, but in v1.27 and v1.28 the it wasn't. When running v1.26 of the cloud-controller-manager in our cluster, we observed that the spec.ProviderID field of nodes was populated, but not when running v1.27 OR v1.28. In addition, we observed NO Update calls by the cloud-controller-manager in v1.27 and v1.28.

We realized that in v1.26, the node-controller was configured to run by default on release-1.26: https://github.com/kubernetes-sigs/cloud-provider-azure/blob/47d04bebceda815d88821badf318ba6bd061e9f5/vendor/k8s.io/cloud-provider/controllers/node/node_controller.go#L181

but in v1.27: https://github.com/kubernetes-sigs/cloud-provider-azure/blob/95a67a1597c85b1565855320c3a8603ece2677ce/vendor/k8s.io/cloud-provider/controllers/node/node_controller.go#L184-L188

and v1.28: https://github.com/kubernetes-sigs/cloud-provider-azure/blob/492d2fb0c241cd0877d9f70c4ef97462f274e252/vendor/k8s.io/cloud-provider/controllers/node/node_controller.go#L188-L192

The node-controller would only run if the workerCount variable was configured.

The workerCount variable gets configured in the NewCloudNodeController function in release-1.27:

https://github.com/kubernetes-sigs/cloud-provider-azure/blob/95a67a1597c85b1565855320c3a8603ece2677ce/vendor/k8s.io/cloud-provider/controllers/node/node_controller.go#L113C6-L118

which is supposed to get configured via the command line argument --concurrent-node-syncs CLI flag: https://github.com/kubernetes-sigs/cloud-provider-azure/blob/be9e1821ee4fec885264eb256192a2bc419efbe4/cmd/cloud-controller-manager/app/core.go#L54

In release-1.27 and release-1.28, this CLI flag is not provided: https://github.com/kubernetes-sigs/cloud-provider-azure/blob/release-1.27/cmd/cloud-controller-manager/app/options/options.go#L134-L155

Justification

The cloud-node-controller configuration was removed from v1.27 and v1.28 of the cloud-controller-manager, thereby preventing the controller from being run. It seems like node-controller should be run as part of the cloud-node-manager instead of cloud-controller-manager. Based on this issue, we understand that the reason behind this change is due to throttling by the cloud-provider, but in v1.26 and prior to the cloud-controller-manager, we did not encounter issues with cloud provider limits on Azure when the node-controller ran in the kube-controller-manager. In addition, if a user does not want to run the node-controller as part of the cloud-controller-manager they can simply exclude it from the list of controllers by doing -cloud-node-controller and can deploy the cloud-node-manager.

Which issue(s) this PR fixes:

Fixes https://github.com/kubernetes-sigs/cloud-provider-azure/issues/1700

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Runs the cloud-node-controller if it is included as part of the list of controllers in the cloud-controller-manager.
linux-foundation-easycla[bot] commented 3 days ago

CLA Signed


The committers listed above are authorized under a signed CLA.

k8s-ci-robot commented 3 days ago

Welcome @aneesh1!

It looks like this is your first PR to kubernetes-sigs/cloud-provider-azure 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cloud-provider-azure has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot commented 3 days ago

Hi @aneesh1. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
aneesh1 commented 3 days ago

/approve /lgtm

k8s-ci-robot commented 3 days ago

@aneesh1: you cannot LGTM your own PR.

In response to [this](https://github.com/kubernetes-sigs/cloud-provider-azure/pull/6498#issuecomment-2197419562): >/approve >/lgtm Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
k8s-ci-robot commented 3 days ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aneesh1 Once this PR has been reviewed and has the lgtm label, please assign lzhecheng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubernetes-sigs/cloud-provider-azure/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
nilo19 commented 3 days ago

This is where the cloud-node-manager starts running, please check. https://github.com/kubernetes-sigs/cloud-provider-azure/blob/master/pkg/nodemanager/nodemanager.go#L190

By the way, the cloud-node-manager runs as a daemonset separated from cloud-controller-manager since very long ago, not from v1.26.