Open a7i opened 1 month ago
Created a PR to get feedback: https://github.com/karmada-io/karmada/pull/4689
Related to it https://github.com/karmada-io/karmada/pull/1859
User Story 1: As a cluster admin, we may get stuck in a reconciliation loop where karmada controller-manager will update a resource but due to some unknown reason (perhaps a controller on the member cluster) the resource is reverted. The operator can decide to suspend work while debugging the issue.
+1 on this user story. This feature would be helpful for debugging.
User Story 2: We're using Karmada to migrate workload from one blue cluster to a green cluster. Once we move a workload, we want to suspend any updates from blue to green until we cut over to green.
May I ask for more detailed info about how you do the migration? I don't understand how can the workload be synced from blue
to green
, I understand both the blue
and green
clusters are taking karmada-apiserver
as the source.
Just a guess, the process of migrating a workload would be something like:
Step 1: Create a PropagationPolicy to take over an application from blue
cluster, like
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: foo
spec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: nginx
placement:
clusterAffinity:
clusterNames:
- blue
Step 2: Add green
cluster to the same PropagationPolicy, so that the application can be synced:
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: foo
spec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: nginx
placement:
clusterAffinity:
clusterNames:
- blue
- green
Step 3: Testing against the green
cluster.
Step 4: Remove the application from the blue
cluster by removing blue
from the PropagationPolicy
:
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: foo
spec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: nginx
placement:
clusterAffinity:
clusterNames:
- green
- Expose suspend on Work CRD This is the simplest approach but requires the Cluster Admin to identify Work in the karmada-es-${cluster} namespace before patching them. Once the field is set, the controller will get out early. In the code here https://github.com/karmada-io/karmada/blob/master/pkg/controllers/status/work_status_controller.go#L93-L96, we would add the following:
I totally agree that Work
deserves to have a field to present if the propagation should be suspended.
Actually we already have an annotation(propagation.karmada.io/instruction
:suppressed
) for the same purpose:
https://github.com/karmada-io/karmada/blob/9ccc8be46c135005289367b8867f4b2c82ca44c0/pkg/util/constants.go#L41-L47
But this annotation is used internally in karmada itself in scenario of a Work
is not managed by a ResourceBinding
.
But, in your case, even after Work
introduced the .spec.suspend
, your modification maybe overrides by ResourceBinding controller, see
https://github.com/karmada-io/karmada/blob/master/pkg/util/helper/work.go#L81.
But don't worry, there are also two approaches to eliminate the risk:
I tend to introduce the functionality to ResourceBinding
or PropagationPolicy
, looking forward to hearing your use case.
Hi @a7i, we are going to push this feature recently. Can you help to provide some feedback with @RainbowMango above?
May I ask for more detailed info about how you do the migration? I don't understand how can the workload be synced from blue to green, I understand both the blue and green clusters are taking karmada-apiserver as the source.
While we have long-term plans to use karmada apiserver for handling our multi-cluster / blue-green uses-cases, we currently reuse our kube-apiserver (on the blue cluster). We essentially bootstrap karmada controller-manager, webhook, scheduler, aggregated-api onto an existing cluster (which reuses the kube-apiserver). We then use policies to migrate workloads gradually to another cluster. I realize that majority of karmada use-cases do not follow this pattern.
Actually we already have an annotation(propagation.karmada.io/instruction:suppressed) for the same purpose:
Ideally we deprecate this annotation in favor of using the same .spec.Suspend
or consider using this annotation for consistency.
Mutate each field when dealing with the conflict. See [what we have done at the detector](Expose suspend on ResourceBinding CRD). I tend to introduce the functionality to ResourceBinding or PropagationPolicy, looking forward to hearing your use case.
I agree with your recommendation. Let me pick this back up and have a PR ready this week.
@XiShanYongYe-Chang I have updated the PR to reflect the requested changes. I look forward to your feedback
We then use policies to migrate workloads gradually to another cluster.
Can you elaborate on this step a little bit?
Can you elaborate on this step a little bit?
Given that we have two clusters:
Let's take a simple workload (Deployment with 3 replicas and Service) as an example.
MultiClusterService
placement.clusterAffinity.clusterNames[0]
)
-- observe that it created a Deployment in cluster-green with zero replicas
-- observe that the original Deployment in cluster-blue is still 3 replicasThanks for the detailed explanation.
Do you need to pause work at each step interval between 3 and 9 steps? If so, what condition is the pause work waiting for?
No, given that we still rely on karmada-controller-manager for handling propagation syncs. After step 9, however, it needs to be suspended.
I think this feature request is generic enough to not have to make our use-case the only use-case. I believe that "User Story 1" from the Issue description is a valid use-case and as a cluster-admin, I want to have this capability at my disposal.
Thanks @a7i.
I wonder if Karmada can do more to facilitate users to use out-of-the-box capabilities, such as canary release, in addition to providing APIs for suspending work propagation (which can be considered as an atomic capability).
I may be wrong, please correct me if so.
Hi @a7i I would like to know, do you use argo-cd for canary release? I was wondering if we could do some combination with argo-cd by suspending work.
What would you like to be added: Ability to suspend work to ensure that changes are not being reconciled.
Why is this needed:
User Story 1: As a cluster admin, we may get stuck in a reconciliation loop where karmada controller-manager will update a resource but due to some unknown reason (perhaps a controller on the member cluster) the resource is reverted. The operator can decide to suspend work while debugging the issue.
User Story 2: We're using Karmada to migrate workload from one blue cluster to a green cluster. Once we move a workload, we want to suspend any updates from blue to green until we cut over to green.
Persona
Cluster Admin who is oncall and has permission to modify karmada resources
Implementation Details
There are three ways that we can go about this:
1. Expose
suspend
onWork
CRDThis is the simplest approach but requires the Cluster Admin to identify
Work
in thekarmada-es-${cluster}
namespace before patching them. Once the field is set, the controller will get out early. In the code here https://github.com/karmada-io/karmada/blob/master/pkg/controllers/status/work_status_controller.go#L93-L96, we would add the following:Effort: Low
2. Expose
suspend
onResourceBinding
CRDThis requires the
suspend
field onResourceBinding
which will also get updated toWork
. The Cluster Admin will have to identify theResourceBinding
in the workload namespace before patching them. The name of resource binding is more predictable thanWork
.Effort: Medium
3. Expose
suspend
onPropagationPolicy
andClusterPropagationPolicy
CRDThe Cluster Admin will have to know the PP or CPP with the highest priority and decide to suspend them. Changes to
suspend
will have to get updated toResourceBinding
andWork
.Effort: X-Large