karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
https://karmada.io
Apache License 2.0
4.11k stars 805 forks source link

Modifying the Job's PropogationPolicy causes the Job to rebuild unexpectedly #3171

Open 1285yvonne opened 1 year ago

1285yvonne commented 1 year ago

What happened:

We had a batch of jobs, and the original PropagationPolicy is configured as clusterA and clusterB. Jobs have been successfully created in both clusters, and only executed in the clusterA cluster.

# get jobs on clusterA  
$ kubectl get job 
NAME                        COMPLETIONS   DURATION   AGE
test-new-cronjob-27940858   1/1           26s        6m4s
test-new-cronjob-27940859   1/1           26s        5m4s
test-new-cronjob-27940860   1/1           27s        4m4s
test-new-cronjob-27940861   1/1           26s        3m4s
test-new-cronjob-27940862   1/1           26s        2m4s
test-new-cronjob-27940863   1/1           27s        64s
test-new-cronjob-27940864   0/1           4s         4s

# get jobs on clusterA  
$ kubectl get job 
NAME                        COMPLETIONS   DURATION   AGE
test-new-cronjob-27940858   0/0           0s         6m3s
test-new-cronjob-27940859   0/0           0s         5m3s
test-new-cronjob-27940860   0/0           0s         4m3s
test-new-cronjob-27940861   0/0           0s         3m3s
test-new-cronjob-27940862   0/0           0s         2m3s
test-new-cronjob-27940863   0/0           0s         63s
test-new-cronjob-27940864   0/0           0s         3s

Then we adjusted pp configuration, remove clusterA, keep only clusterB. We found that the FULLYAPPLIED of all jobs' ResourceBinding(including jobs that have been successfully delivered but not cleared) changed to False

$ kubectl --kubeconfig=karmada.kubeconfig get rb
NAME                            SCHEDULED   FULLYAPPLIED   AGE
test-new-cronjob-27940858-job   True        False          6m20s
test-new-cronjob-27940859-job   True        False          5m20s
test-new-cronjob-27940860-job   True        False          4m20s
test-new-cronjob-27940861-job   True        False          3m20s
test-new-cronjob-27940862-job   True        False          2m20s
test-new-cronjob-27940863-job   True        False          80s
test-new-cronjob-27940864-job   True        False          20

The reason for false is that the k8s control plane of clusterB rejected the request of karmada-agent to modify the job configuration. At the same time, the old jobs ran in clusterA had been deleted.

status:
  aggregatedStatus:
  - appliedMessage: 'Failed to apply all manifests (0/1): Job.batch "test-new-cronjob-27940860"
      is invalid: spec.completions: Invalid value: 1: field is immutable'
    clusterName: clusterB
  conditions:
  - lastTransitionTime: "2023-02-15T09:04:16Z"
    message: Failed to apply all works, see status.aggregatedStatus for details
    reason: FullyAppliedFailed
    status: "False"
    type: FullyApplied
  - lastTransitionTime: "2023-02-15T09:00:01Z"
    message: Binding has been scheduled
    reason: BindingScheduled
    status: "True"
    type: Scheduled

Then we adjusted the pp configuration and add clusterA back

View the rb resources on the control plane, and you can find that the rb resources left over from the failure of the second step of tuning have been retuned successfully, and the job has also been recreated and delivered to clusterA for execution again.

$ kubectl --kubeconfig=karmada.kubeconfig get rb 
NAME                            SCHEDULED   FULLYAPPLIED   AGE
test-new-cronjob-27940858-job   True        True           24m
test-new-cronjob-27940859-job   True        True           23m
test-new-cronjob-27940860-job   True        True           22m
test-new-cronjob-27940861-job   True        True           21m
test-new-cronjob-27940862-job   True        True           20m
test-new-cronjob-27940863-job   True        True           19m
test-new-cronjob-27940864-job   True        True           18m

What you expected to happen:

I think this is a bug need to be fixed. When the status of the job changes to execution completion (completed or failure), these jobs should not be rescheduled after the pp change, otherwise it will cause unexpected repeated execution of the job.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

1285yvonne commented 1 year ago

@RainbowMango @XiShanYongYe-Chang please take a look

RainbowMango commented 1 year ago

and the original PropagationPolicy is configured as clusterA and clusterB.

Can you share the PropagationPolicy here?

1285yvonne commented 1 year ago

just like this:

apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: cj-policy
spec:
  dependentOverrides:
  - cj-op
  association: true
  propagateDeps: true
  placement:
    clusterAffinity:
      clusterNames:
      - clusterA
      - clusterB
    replicaScheduling:
      replicaDivisionPreference: Weighted
      replicaSchedulingType: Divided
  resourceSelectors:
  - apiVersion: batch/v1
    kind: Job
RainbowMango commented 1 year ago

Thanks. By the way, the .spec.association is deprecated, but no harm with it.

cc @jwcesign for help Does this issue also happen on the latest release? Like v1.4?

1285yvonne commented 1 year ago

Does this issue also happen on the latest release? Like v1.4?

We don't have a v1.4 cluster yet. Our current production environment uses version v1.2.2. So I have no idea if it will be reproduced in version 1.4 , but if the logic of the relevant controller has not been modified, it is theoretically possible to reproduce it in version 1.4

RainbowMango commented 1 year ago

OK. No worries. @jwcesign will try to reproduce it against the master branch.

I'm glad to hear that you are using Karmada in the production environment. Does your organization or company present on the Adopter list?

1285yvonne commented 1 year ago

I'm glad to hear that you are using Karmada in the production environment. Does your organization or company present on the Adopter list?

no, but doesn't matter.

jwcesign commented 1 year ago

/assign

jwcesign commented 1 year ago

I tried with following yaml:

apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: cj-policy
spec:
  association: true
  propagateDeps: true
  placement:
    clusterAffinity:
      clusterNames:
      - member2
      - member1
    replicaScheduling:
      replicaDivisionPreference: Weighted
      replicaSchedulingType: Divided
  resourceSelectors:
  - apiVersion: batch/v1
    kind: Job
    name: batch-job

---
apiVersion: batch/v1
kind: Job
metadata:
  name: batch-job
spec:
  completions: 40
  parallelism: 3
  template:
    metadata:
      namespace: luksa
      labels:
        app: batch-job
    spec:
      restartPolicy: OnFailure
      containers:
      - name: man
        image: luksa/batch-job

First with member1/member2, then delete member1.

  status:
    aggregatedStatus:
    - appliedMessage: 'Failed to apply all manifests (0/1): Job.batch "batch-job"
        is invalid: spec.completions: Invalid value: 40: field is immutable'
      clusterName: member2
      health: Unknown
    conditions:
    - lastTransitionTime: "2023-02-21T03:11:14Z"
      message: Failed to apply all works, see status.aggregatedStatus for details
      reason: FullyAppliedFailed
      status: "False"
      type: FullyApplied
    - lastTransitionTime: "2023-02-21T03:10:36Z"

It is reproduced in release-1.2 and release-1.4, I am trying to figure it out.

jwcesign commented 1 year ago

Hi, @1285yvonne, can you tell me why you need to delete the clusters in PP? Failover Simulation?

1285yvonne commented 1 year ago

Hi, @1285yvonne, can you tell me why you need to delete the clusters in PP? Failover Simulation?

Yes, some components on clusterA went wrong, but cluster's status was still ready. our users just deleted clusterA in PP to failover

jwcesign commented 1 year ago

Yes, some components on clusterA went wrong, but cluster's status was still ready. our users just deleted clusterA in PP to failover

So the behavior you want is: to reschedule the jobs that are not finished to other member clusters?

1285yvonne commented 1 year ago

So the behavior you want is: to reschedule the jobs that are not finished to other member clusters?

Yes, and the jobs finished should not be schduled and executed twice.

jwcesign commented 1 year ago

Hi @1285yvonne, have some other questions:

  1. Why do you need to split jobs into multi-clusters?
  2. What kind of jobs is running? AI? Data Processing? CI?
1285yvonne commented 1 year ago
  1. Why do you need to split jobs into multi-clusters?

emmm, we expect that multiple clusters can be selected for job scheduling, so as to prevent the job from being unable to be scheduled when a problem occurs in a single cluster. But in fact there is a problem here that bothers us. We actually expect that jobs can be scheduled to run on more sufficient clusters according to the resources of the clusters. But karmada-scheduler always chooses the first one in the cluster list.

  1. What kind of jobs is running? AI? Data Processing? CI?

The job is related to the business, and some jobs are not allowed to be executed repeatedly

jwcesign commented 1 year ago

But in fact there is a problem here that bothers us. We actually expect that jobs can be scheduled to run on more sufficient clusters according to the resources of the clusters. But karmada-scheduler always chooses the first one in the cluster list.

If you use DynamicWeight, the job will be schaduled to the clusters with more resources. Does this solve the problem?

1285yvonne commented 1 year ago

If you use DynamicWeight, the job will be schaduled to the clusters with more resources. Does this solve the problem?

We have tried before, but the effect was not satisfactory. When the resources of the two clusters are sufficiently scheduled, it will always be dispatched to the first cluster

jwcesign commented 1 year ago

it will always be dispatched to the first cluster

Looks unreasonable, it should schedule the replicas to both of them. Can you give more details? How do you set the policy and what is the status of the member clusters?

1285yvonne commented 1 year ago

Looks unreasonable, it should schedule the replicas to both of them. Can you give more details? How do you set the policy and what is the status of the member clusters?

Sry, I forget to fill the details. In our case, most of our jobs are one replica which just can be executed once. So it turns out that every job is dispatched to the first cluster

1285yvonne commented 1 year ago

Would you provide more suggestions for our scenario?

jwcesign commented 1 year ago

So it turns out that every job is dispatched to the first cluster

I think that means the first cluster has more available resources. If the first cluster has fewer resources than the second one, the job will be scheduled to the second one.

1285yvonne commented 1 year ago

I have read the karmada-scheduler code (v1.2.2). I'm not sure if my understanding is correct, but it seems to me that karmada-scheduler only judges whether there are enough resources in the cluster to be scheduled, but does not judge which of the two clusters has more resources.In our scenario, it can be considered that the resources of both cluster A and cluster B are sufficient and the resource exceeds the request by a lot.

jwcesign commented 1 year ago

The core code is here, the behavior should be same with what I said: https://github.com/karmada-io/karmada/blob/02dfe2ec0ef0ff58661d655b69fd845fdb988d13/pkg/scheduler/core/division_algorithm.go#L156

And you really use dynamicWeight? can you show me your original PP. And do you deploy karmada-estimator?

1285yvonne commented 1 year ago

I don't know how to quote code, so I paste the code hear. You can see when divideRemainingReplicas, just a loop, and assign the number of replicas to the cluster list in order. So in our scenario, when the job only needs one copy number, it will be distributed to the first cluster all the time.

// divideRemainingReplicas divide remaining Replicas to clusters and calculate desiredReplicaInfos
func divideRemainingReplicas(remainingReplicas int, desiredReplicaInfos map[string]int64, clusterNames []string) {
    if remainingReplicas <= 0 {
        return
    }

    clusterSize := len(clusterNames)
    if remainingReplicas < clusterSize {
        for i := 0; i < remainingReplicas; i++ {
            desiredReplicaInfos[clusterNames[i]]++
        }
    } else {
        avg, residue := remainingReplicas/clusterSize, remainingReplicas%clusterSize
        for i := 0; i < clusterSize; i++ {
            if i < residue {
                desiredReplicaInfos[clusterNames[i]] += int64(avg) + 1
            } else {
                desiredReplicaInfos[clusterNames[i]] += int64(avg)
            }
        }
    }
}

And was the reschedule a bug or a feature that we need to avoid for our own?

jwcesign commented 1 year ago

I tested with following yaml:

root@karmada-dev-linux-jiangwei [02:31:58 PM] [~/workspace/offcial] [master *]
-> # cat job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: pi-{index}
spec:
  template:
    spec:
      containers:
      - name: pi
        image: perl:5.34.0
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4
---
apiVersion: batch/v1
kind: Job
metadata:
  name: pi-{index}
spec:
  template:
    spec:
      containers:
      - name: pi
        image: perl:5.34.0
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4

The job will be scheduled to multiple clusters:

root@karmada-dev-linux-jiangwei [02:31:51 PM] [~/workspace/offcial] [master *]
-> # karmadactl get jobs | grep -v 0/0
NAME    CLUSTER   COMPLETIONS   DURATION   AGE   ADOPTION
pi-1    member1   0/1           8s         8s    Y
pi-10   member1   0/1           8s         8s    Y
pi-12   member1   0/1           10s        10s   Y
pi-18   member1   0/1           10s        10s   Y
pi-19   member1   0/1           9s         9s    Y
pi-2    member1   0/1           9s         9s    Y
pi-20   member1   0/1           8s         8s    Y
pi-3    member1   0/1           10s        10s   Y
pi-4    member1   0/1           10s        10s   Y
pi-5    member1   0/1           8s         8s    Y
pi-6    member1   0/1           10s        10s   Y
pi-11   member2   0/1           11s        11s   Y
pi-13   member2   0/1           11s        11s   Y
pi-14   member2   0/1           7s         7s    Y
pi-15   member2   0/1           11s        11s   Y
pi-16   member2   0/1           10s        10s   Y
pi-17   member2   0/1           10s        10s   Y
pi-7    member2   0/1           10s        10s   Y
pi-8    member2   0/1           11s        11s   Y
pi-9    member2   0/1           11s        11s   Y

For function divideRemainingReplicas, the clusterNames array is sorted by their available resources. So other clusters will be chosen once the resource is less.

1285yvonne commented 1 year ago

Thx, I got it. It may because the size of our member clusters is relatively large, and it is more difficult to trigger. We will find some smaller scale clusters test this.

jwcesign commented 1 year ago

Hi, @1285yvonne

emmm, we expect that multiple clusters can be selected for job scheduling, so as to prevent the job from being unable to be scheduled when a problem occurs in a single cluster.

From what you described, you want to select multiple clusters to avoid scheduling failure, but the jobs could be just scheduled to only one cluster(failover when this cluster fails, but it's not necessary to split the jobs into multi-clusters), Do I understand correctly?

Or there are other scenarios? i.e., one cluster hasn't enough resources, but with multiple clusters, it has. So you have to split the jobs.

1285yvonne commented 1 year ago

Sorry for replying so late.

Or there are other scenarios? i.e., one cluster hasn't enough resources, but with multiple clusters, it has. So you have to split the jobs.

yes, it's one of our scenarios. There is another scene. For example, a single sub-cluster needs to change or downgrade the core services in the cluster, but we don't expect the change of a single cluster to affect the execution of the new job, that is, we don't want to miss the job. So we need to configure multiple clusters in pp.