karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
https://karmada.io
Apache License 2.0
4.4k stars 871 forks source link

How to handle dynamic parameters which need to be changed every time before generating works for member cluster? #3436

Open mkloveyy opened 1 year ago

mkloveyy commented 1 year ago

Below is a basic scenario in our release process:

image

A fort pod means the first machine deployed in a cluster. During the bastion deployment phase, no traffic is routed to it. And in the canary deployment phase, traffic is routed to it.

We've implement a deploy CRD called Rollout based on openkruise to handle the whole process. Part of rollout's definition for above case is as follows:

apiVersion:
kind: Rollout
metadata:
  name:
  namespace:
spec:
  fortPull: false
  replicas: 4
  selector:
    matchLabels:
      app: 
  strategy:
    canary:
      steps:
      - properties:  # fort
          delay: "-1"
          pull: "false"
        type: Fort
      - properties:  # canary
          delay: "-1"
          pull: "true"
        type: Fort
      - properties:  # first batch of rolling
          delay: "-1"
          maxUnavailable: "0"
          replicas: 50%
          timeout: "5"
        type: Batch
      - properties:  # second batch of rolling
          delay: "-1"
          maxUnavailable: "0"
          replicas: 100%
          timeout: "5"
        type: Batch
  subsetType: CloneSet
  template:
    ...

So in multi-cluster case, every time user creating a release, we need to do:

  1. Generate canary.steps for all member clusters.
  2. Allocate a fort pod to one of member clusters (values in steps are different between fort and non-fort clusters).

For example, above rollout need to be updated in two clusters: cluster-x and cluster-z. The differences between two cluster are as follows:

# cluster-x.yaml, fort pod in this cluster
strategy:
    canary:
      steps:
      - properties:  # fort
          delay: "-1"
          pull: "false"  # pull out fort
        type: Fort
      - properties:  # canary
          delay: "-1"
          pull: "true"  # pull out fort
        type: Fort
      ...

# cluster-z.yaml, no fort pod
strategy:
    canary:
      steps:
      - properties:  # fort
          delay: "-1"
          pull: nil  # skip
        type: Fort
      - properties:  # canary
          delay: "-1"
          pull: nil  # skip
        type: Fort
      ...

Fort pod is allocated by below rules:

  1. If app has no fort pod before, allocate it to the higher weight cluster based on ResourceBinding. If weights are equal, allocate it the first cluster in ResourceBinding clusters by default.
  2. If app has a fort pod, keep it to the current cluster it's staying.

Currently, we are trying to put the code in the reviseReplicas hook (#3375) image

but we encountered some issues during the develop, most related to the fort pod:

  1. If replicas is 0, reviseReplicas hook will not be executed. We enable users to start a release if replicas is 0, but we cannot update in this case as reviseReplicas hook won't be reached.
    // pkg/controllers/binding/common.go
    func needReviseReplicas(replicas int32, placement *policyv1alpha1.Placement) bool {
    return replicas > 0 && placement != nil && placement.ReplicaSchedulingType() == policyv1alpha1.ReplicaSchedulingTypeDivided
    }
  2. We cannot get cluster_name in reviseReplicas hook, so we cannot judge if we need to put fort pod. Usually we put the fort pod to cluster with higher weight. I the weights are evenly distributed, we decide to select the first cluster in ResourceBinding by default. But we can't know the cluster in this hook as it was not passed.
    // pkg/controllers/binding/common.go
    for i := range targetClusters {
        ...
        if needReviseReplicas(replicas, placement) {
            if resourceInterpreter.HookEnabled(clonedWorkload.GroupVersionKind(), configv1alpha1.InterpreterOperationReviseReplica) {
                clonedWorkload, err = resourceInterpreter.ReviseReplica(clonedWorkload, int64(targetCluster.Replicas))
                if err != nil {
                    klog.Errorf("Failed to revise replica for %s/%s/%s in cluster %s, err is: %v",
                        workload.GetKind(), workload.GetNamespace(), workload.GetName(), targetCluster.Name, err)
                    return err
                }
            }
        }
                ...
        } 
    }

So, is there any suitable ways to implement such differentiated configuration? And where is the appropriate place to put the code? Maybe expanding reviseReplica by adding cluster_name and removing replicas > 0 judgement and giving a new hook point?

Environment:

XiShanYongYe-Chang commented 1 year ago

/cc @RainbowMango

chaunceyjiang commented 1 year ago

For example, above rollout need to be updated in two clusters: cluster-x and cluster-z. The differences between two cluster are as follows:

apiVersion: policy.karmada.io/v1alpha1
kind: OverridePolicy
spec:
  overrideRules:
    - targetCluster:
        clusterNames:
          - cluster-x
      overriders:
        plaintext:
          - path: /spec/strategy/canary/steps/0/pull
            operator: replace
            value:  "false"
          - path: /spec/strategy/canary/steps/1/pull
            operator: replace
            value:  "true"
    - targetCluster:
        clusterNames:
          - cluster-z
      overriders:
        plaintext:
          - path: /spec/strategy/canary/steps/0/pull
            operator: replace
            value:  nil
          - path: /spec/strategy/canary/steps/1/pull
            operator: replace
            value:  nil

Can this help you?

mkloveyy commented 1 year ago

For example, above rollout need to be updated in two clusters: cluster-x and cluster-z. The differences between two cluster are as follows:

apiVersion: policy.karmada.io/v1alpha1
kind: OverridePolicy
spec:
  overrideRules:
    - targetCluster:
        clusterNames:
          - cluster-x
      overriders:
        plaintext:
          - path: /spec/strategy/canary/steps/0/pull
            operator: replace
            value:  "false"
          - path: /spec/strategy/canary/steps/1/pull
            operator: replace
            value:  "true"
    - targetCluster:
        clusterNames:
          - cluster-z
      overriders:
        plaintext:
          - path: /spec/strategy/canary/steps/0/pull
            operator: replace
            value:  nil
          - path: /spec/strategy/canary/steps/1/pull
            operator: replace
            value:  nil

Can this help you?

Thanks for your reply!

Currently we do not use OverridePolicy to put canary.steps config in but implement the generating code in reviseReplicas hook and it will be directly applied on works.

1. If replicas is 0, reviseReplicas hook will not be executed. 2. We cannot get cluster_name in reviseReplicas hook, so we cannot judge if we need to put fort pod.

And still we cannot decide which cluster to put the fort pod in as issues above.

chaunceyjiang commented 1 year ago

Currently we do not use OverridePolicy to put canary.steps config in but implement the generating code in reviseReplicas hook and it will be directly applied on works.

I'm sorry, I didn't understand why you don't use OverridePolicy and instead want to use reviseReplicas?

mkloveyy commented 1 year ago

I'm sorry, I didn't understand why you don't use OverridePolicy and instead want to use reviseReplicas? Mainly for two reason:

  1. We need to regenerate /spec/strategy/canary/steps every time when user creates a release. If we put this dynamic config in op, we need to ensure op updated first before user updates the global resource. I think now we don't have this option.
  2. We consider OverridePolicy is not very suitable for put dynamic configs. Now we put some static config in OverridePolicy:
    
    apiVersion: policy.karmada.io/v1alpha1
    kind: OverridePolicy
    metadata:
    annotations:
    versionid: "47677"
    name: 
    namespace: 
    spec:
    resourceSelectors:
    - apiVersion: 
    kind: Rollout
    name:
    namespace:
    overrideRules:
    - targetCluster:
    clusterNames:
    - cluster-x
    overriders:
    plaintext:
    - path: /metadata/annotations/cdos~1az
    operator: replace
    value: X
    - path: /spec/template/spec/containers/0/env/11
    operator: replace
    value: 
    name: CDOS_K8S_CLUSTER
    value: cluster-x
    - targetCluster:
    clusterNames:
    - cluster-z
    overriders:
    plaintext:
    - path: /metadata/annotations/cdos~1az
    operator: replace
    value: Z
    - path: /spec/template/spec/containers/0/env/11
    operator: replace
    value: 
    name: CDOS_K8S_CLUSTER
    value: cluster-z
Poor12 commented 1 year ago

ReviseReplicas is basicly for dividing replicas when replica scheduling policy is divided. I don't think your scenario is suitable to use it because you do not change replicas in resource template but change its strategy in different clusters.

mkloveyy commented 1 year ago

ReviseReplicas is basicly for dividing replicas when replica scheduling policy is divided. I don't think your scenario is suitable to use it because you do not change replicas in resource template but change its strategy in different clusters.

Thanks for your reply!

Where do you think would be the appropriate place to put this piece of logic? Or, to expand the scope a bit, how could we better implement such scenarios?

chaunceyjiang commented 1 year ago

We cannot get cluster_name in reviseReplicas hook, so we cannot judge if we need to put fort pod.

Similar problem. https://github.com/karmada-io/karmada/issues/3159

I'm thinking about whether we should introduce cluster info in the ReviseReplica step.

mkloveyy commented 1 year ago

We cannot get cluster_name in reviseReplicas hook, so we cannot judge if we need to put fort pod.

Similar problem. #3159

I'm thinking about whether we should introduce cluster info in the ReviseReplica step.

Agreed. If do so, we can implement some custom logic that need to base on cluster info.

RainbowMango commented 1 year ago

If replicas is 0, reviseReplicas hook will not be executed.

Do you mean you want to revise the Rollout by reviseReplicas hook point? Have you implemented InterpretReplica?

We need to regenerate /spec/strategy/canary/steps every time when user creates a release.

Does creates a release means create a Rollout?

I talked to @snowplayfire yesterday about this issue. Technically, we can do that by extending the resource interpreter but seems it essentially belongs to the override area. Not sure yet, maybe we can have a chat at community meeting.

mkloveyy commented 1 year ago

Do you mean you want to revise the Rollout by reviseReplicas hook point? Have you implemented InterpretReplica?

Yes, we have implemented InterpretReplica. However, as we need the newest ResourceBinding, we need to revise the Rollout between BuildResourceBinding and ensureWork.

We need to regenerate /spec/strategy/canary/steps every time when user creates a release.

Does creates a release means create a Rollout?

Roughly correct. A release means a deploy access which creates or updates a Rollout.

I talked to @snowplayfire yesterday about this issue. Technically, we can do that by extending the resource interpreter but seems it essentially belongs to the override area.

My pleasure. Today we had a meeting with @snowplayfire and I've heard about this from her. And we've also communicated with @XiShanYongYe-Chang and @chaunceyjiang. Currently I think maybe both the two solutions(Expanding reviseReplica by adding cluster_name and Giving a new hook point) have more universal value and can solve our problems.

Not sure yet, maybe we can have a chat at community meeting.

Agreed. If possible, we hope to have some concrete discussions and reach some conclusions at next Tuesday's meeting. Anyway, thank you very much for your assistance!😊

CharlesQQ commented 1 year ago

Does this issue meet the requirements? https://github.com/karmada-io/karmada/issues/1567

CharlesQQ commented 1 year ago

Or add pause field for PP, issue like: https://github.com/karmada-io/karmada/issues/517

mkloveyy commented 1 year ago

Does this issue meet the requirements? #1567

Maybe this plan is more suitable for us. However, as OverridePolicy seems to be used for storing static parameters, I consider whether dynamic parameters can be applied on the Work directly.

mkloveyy commented 1 year ago

@RainbowMango I think maybe there is another option that we can introduce a new hook point maybe called reviseObject. This hook appears after op has been applied on works.

cops, ops, err := overrideManager.ApplyOverridePolicies(clonedWorkload, targetCluster.Name)
if err != nil {
    klog.Errorf("Failed to apply overrides for %s/%s/%s, err is: %v", clonedWorkload.GetKind(), clonedWorkload.GetNamespace(), clonedWorkload.GetName(), err)
        return err
}

if resourceInterpreter.HookEnabled(clonedWorkload.GroupVersionKind(), configv1alpha1.InterpreterOperationReviseObject) {
    clonedWorkload, err = resourceInterpreter.ReviseObject(clonedWorkload, targetCluster.Name, int64(targetCluster.Replicas))
        if err != nil {
            klog.Errorf("Failed to revise object for %s/%s/%s in cluster %s, err is: %v", workload.GetKind(), workload.GetNamespace(), workload.GetName(), targetCluster.Name, err)
            return err
        }
}

The definition of this hook is to allow users to directly modify the work.

XiShanYongYe-Chang commented 1 year ago

Maybe there is another option that we can introduce a new hook point maybe called reviseObject. This hook appears after op has been applied on works.

Ask some guys to help take a look @RainbowMango @chaunceyjiang @CharlesQQ @Poor12

mkloveyy commented 1 year ago

@shiyan2016 @snowplayfire @lxtywypc We can have a look together.

chaunceyjiang commented 1 year ago

cops, ops, err := overrideManager.ApplyOverridePolicies(clonedWorkload, targetCluster.Name)
if err != nil {
    klog.Errorf("Failed to apply overrides for %s/%s/%s, err is: %v", clonedWorkload.GetKind(), clonedWorkload.GetNamespace(), clonedWorkload.GetName(), err)
        return err
}

if resourceInterpreter.HookEnabled(clonedWorkload.GroupVersionKind(), configv1alpha1.InterpreterOperationReviseObject) {
    clonedWorkload, err = resourceInterpreter.ReviseObject(workload, clonedWorkload, targetCluster.Name)
        if err != nil {
            klog.Errorf("Failed to revise object for %s/%s/%s in cluster %s, err is: %v", workload.GetKind(), workload.GetNamespace(), workload.GetName(), targetCluster.Name, err)
            return err
        }
}

My suggestion is resourceInterpreter.ReviseObject(workload, clonedWorkload, targetCluster.Name)

CharlesQQ commented 1 year ago

I think maybe there is another option that we can introduce a new hook point maybe called reviseObject. This hook appears after op has been applied on works.

reviseObject may be lead to user don't know which cluster's work object is revised?

mkloveyy commented 1 year ago

reviseObject may be lead to user don't know which cluster's work object is revised?

That's why we need to introduce targetCluster.Name to it.