kubernetes-sigs / kernel-module-management

The kernel module management operator builds, signs and loads kernel modules in Kubernetes clusters.
https://kmm.sigs.k8s.io/
Apache License 2.0
90 stars 26 forks source link

Canary Deployment Support #116

Closed uMartinXu closed 1 year ago

uMartinXu commented 2 years ago

Issue Summary:

The Canary Deployment should be supported for user to execute some Testing to make sure the OOT driver can work properly on the cluster and prevent the cluster from being impacted by the potential severe issues brought by the new OOT driver modules.

Suggested Priority(P1-P3) & Urgent(Urgent, medium, Low):

P2 & Medium

Issue Detail:

It is quite possible that the Kernel on the work node can be impacted by some potential issues bring by the OOT driver module. If KMMO directly deploy the driver container image to all the work nodes in the cluster without careful testing on the user clusters environment, it might cause huge problem to the cluster. KMMO can reduce this risk by supporting Canary Deployment. Using Canary Deployment, user can specify limited number of nodes( or specify some dedicated nodes) in the cluster to deploy the driver container images. And after that, user can use some Canary testing (or run some real workloads) on these nodes and verify the stability and other quality of the driver. Only when the user can be fully confident, user can then deploy the driver container image to all the other nodes in the clusters.

Solution Proposal

User can specify the number of the nodes to deploy the driver, and KMMO can pick up the nodes for user to deploy the driver. User can also specify some dedicated nodes for KMMO to deploy the driver container image. User can select to enable and disable Canary Deployment

yevgeny-shnaidman commented 2 years ago

It seems that canary mode can be implemented simply by labeling "canary" nodes and adding that label to the Selector field of the Module. Will that be a sufficient solution?

GerrySeidman commented 2 years ago

The approach I am using with SRO is just labeling nodes and deploying multiple SpecialResources, one per corresponding group of node-selected nodes.

I was assuming that the same technique wil work fine with KMMO.

uMartinXu commented 2 years ago

It seems that canary mode can be implemented simply by labeling "canary" nodes and adding that label to the Selector field of the Module. Will that be a sufficient solution?

Yes, that is the basic logical. KMMO can label the proper nodes according to the user configure and the status of the node. For example select the nodes that has no relevant workload running on the old version of driver. Of course user can also label the group of nodes for KMMO to do the Canary deployment. as @GerrySeidman mentioned above. So at this case, KMMO just need to provide below primitive in CRD like canaryDeploymentNodes: DiverACanaryDeploymentLabel

yevgeny-shnaidman commented 2 years ago

@uMartinXu what i meant is that customer can do node labeling and setting the Selector field himself, without any need for KMMO to do anything ( so no need to modify CRD or KMMO code). Also , letting KMMO decide which nodes to use for canary is not a good option: this decisions should be taken by customer only, because only he has all the data needed for making that decision

uMartinXu commented 1 year ago

@yevgeny-shnaidman using Module CRD should be a good idea for Canary deployment. And you are right it is not easy for KMMO to decide the nodes for Canary deployment. Let's have a try on it. BTW, will the selector also influence the building node select? Does the driver building process need to be the same selected nodes with deployment of Container driver?

yevgeny-shnaidman commented 1 year ago

@uMartinXu we don't really care where the build job is scheduled. It is containerized, so we nodes preference is not important

qbarrand commented 1 year ago

We use the Module's main node selector for build placement: https://github.com/kubernetes-sigs/kernel-module-management/blob/5a772aa3496468ba9eaf1de4f28772d6902cdded/internal/build/job/maker.go#L114 This was implemented to solve issues in multi-arch clusters where we saw ARM nodes try to build x86 kmods or vice-versa. If that creates issues, I think we could have a dedicated node selector for builds. @uMartinXu if that's of interest to you, please open an issue 🙂

chaitanya1731 commented 1 year ago

@qbarrand we are quite interested in this feature and will file an issue soon.

chaitanya1731 commented 1 year ago

Please refer - Separate node selector for build process #140

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 1 year ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/kernel-module-management/issues/116#issuecomment-1507352960): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.