Open 1285yvonne opened 1 year ago
@RainbowMango @XiShanYongYe-Chang please take a look
and the original PropagationPolicy is configured as clusterA and clusterB.
Can you share the PropagationPolicy here?
just like this:
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: cj-policy
spec:
dependentOverrides:
- cj-op
association: true
propagateDeps: true
placement:
clusterAffinity:
clusterNames:
- clusterA
- clusterB
replicaScheduling:
replicaDivisionPreference: Weighted
replicaSchedulingType: Divided
resourceSelectors:
- apiVersion: batch/v1
kind: Job
Thanks. By the way, the .spec.association
is deprecated, but no harm with it.
cc @jwcesign for help Does this issue also happen on the latest release? Like v1.4?
Does this issue also happen on the latest release? Like v1.4?
We don't have a v1.4 cluster yet. Our current production environment uses version v1.2.2. So I have no idea if it will be reproduced in version 1.4 , but if the logic of the relevant controller has not been modified, it is theoretically possible to reproduce it in version 1.4
OK. No worries. @jwcesign will try to reproduce it against the master branch.
I'm glad to hear that you are using Karmada in the production environment. Does your organization or company present on the Adopter list?
I'm glad to hear that you are using Karmada in the production environment. Does your organization or company present on the Adopter list?
no, but doesn't matter.
/assign
I tried with following yaml:
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: cj-policy
spec:
association: true
propagateDeps: true
placement:
clusterAffinity:
clusterNames:
- member2
- member1
replicaScheduling:
replicaDivisionPreference: Weighted
replicaSchedulingType: Divided
resourceSelectors:
- apiVersion: batch/v1
kind: Job
name: batch-job
---
apiVersion: batch/v1
kind: Job
metadata:
name: batch-job
spec:
completions: 40
parallelism: 3
template:
metadata:
namespace: luksa
labels:
app: batch-job
spec:
restartPolicy: OnFailure
containers:
- name: man
image: luksa/batch-job
First with member1/member2, then delete member1.
status:
aggregatedStatus:
- appliedMessage: 'Failed to apply all manifests (0/1): Job.batch "batch-job"
is invalid: spec.completions: Invalid value: 40: field is immutable'
clusterName: member2
health: Unknown
conditions:
- lastTransitionTime: "2023-02-21T03:11:14Z"
message: Failed to apply all works, see status.aggregatedStatus for details
reason: FullyAppliedFailed
status: "False"
type: FullyApplied
- lastTransitionTime: "2023-02-21T03:10:36Z"
It is reproduced in release-1.2 and release-1.4, I am trying to figure it out.
Hi, @1285yvonne, can you tell me why you need to delete the clusters in PP? Failover Simulation?
Hi, @1285yvonne, can you tell me why you need to delete the clusters in PP? Failover Simulation?
Yes, some components on clusterA went wrong, but cluster's status was still ready. our users just deleted clusterA in PP to failover
Yes, some components on clusterA went wrong, but cluster's status was still ready. our users just deleted clusterA in PP to failover
So the behavior you want is: to reschedule the jobs that are not finished to other member clusters?
So the behavior you want is: to reschedule the jobs that are not finished to other member clusters?
Yes, and the jobs finished should not be schduled and executed twice.
Hi @1285yvonne, have some other questions:
- Why do you need to split jobs into multi-clusters?
emmm, we expect that multiple clusters can be selected for job scheduling, so as to prevent the job from being unable to be scheduled when a problem occurs in a single cluster. But in fact there is a problem here that bothers us. We actually expect that jobs can be scheduled to run on more sufficient clusters according to the resources of the clusters. But karmada-scheduler always chooses the first one in the cluster list.
- What kind of jobs is running? AI? Data Processing? CI?
The job is related to the business, and some jobs are not allowed to be executed repeatedly
But in fact there is a problem here that bothers us. We actually expect that jobs can be scheduled to run on more sufficient clusters according to the resources of the clusters. But karmada-scheduler always chooses the first one in the cluster list.
If you use DynamicWeight
, the job will be schaduled to the clusters with more resources. Does this solve the problem?
If you use
DynamicWeight
, the job will be schaduled to the clusters with more resources. Does this solve the problem?
We have tried before, but the effect was not satisfactory. When the resources of the two clusters are sufficiently scheduled, it will always be dispatched to the first cluster
it will always be dispatched to the first cluster
Looks unreasonable, it should schedule the replicas to both of them. Can you give more details? How do you set the policy and what is the status of the member clusters?
Looks unreasonable, it should schedule the replicas to both of them. Can you give more details? How do you set the policy and what is the status of the member clusters?
Sry, I forget to fill the details. In our case, most of our jobs are one replica which just can be executed once. So it turns out that every job is dispatched to the first cluster
Would you provide more suggestions for our scenario?
So it turns out that every job is dispatched to the first cluster
I think that means the first cluster has more available resources. If the first cluster has fewer resources than the second one, the job will be scheduled to the second one.
I have read the karmada-scheduler code (v1.2.2). I'm not sure if my understanding is correct, but it seems to me that karmada-scheduler only judges whether there are enough resources in the cluster to be scheduled, but does not judge which of the two clusters has more resources.In our scenario, it can be considered that the resources of both cluster A and cluster B are sufficient and the resource exceeds the request by a lot.
The core code is here, the behavior should be same with what I said: https://github.com/karmada-io/karmada/blob/02dfe2ec0ef0ff58661d655b69fd845fdb988d13/pkg/scheduler/core/division_algorithm.go#L156
And you really use dynamicWeight? can you show me your original PP. And do you deploy karmada-estimator?
I don't know how to quote code, so I paste the code hear. You can see when divideRemainingReplicas, just a loop, and assign the number of replicas to the cluster list in order. So in our scenario, when the job only needs one copy number, it will be distributed to the first cluster all the time.
// divideRemainingReplicas divide remaining Replicas to clusters and calculate desiredReplicaInfos
func divideRemainingReplicas(remainingReplicas int, desiredReplicaInfos map[string]int64, clusterNames []string) {
if remainingReplicas <= 0 {
return
}
clusterSize := len(clusterNames)
if remainingReplicas < clusterSize {
for i := 0; i < remainingReplicas; i++ {
desiredReplicaInfos[clusterNames[i]]++
}
} else {
avg, residue := remainingReplicas/clusterSize, remainingReplicas%clusterSize
for i := 0; i < clusterSize; i++ {
if i < residue {
desiredReplicaInfos[clusterNames[i]] += int64(avg) + 1
} else {
desiredReplicaInfos[clusterNames[i]] += int64(avg)
}
}
}
}
And was the reschedule a bug or a feature that we need to avoid for our own?
I tested with following yaml:
root@karmada-dev-linux-jiangwei [02:31:58 PM] [~/workspace/offcial] [master *]
-> # cat job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: pi-{index}
spec:
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
backoffLimit: 4
---
apiVersion: batch/v1
kind: Job
metadata:
name: pi-{index}
spec:
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
backoffLimit: 4
The job will be scheduled to multiple clusters:
root@karmada-dev-linux-jiangwei [02:31:51 PM] [~/workspace/offcial] [master *]
-> # karmadactl get jobs | grep -v 0/0
NAME CLUSTER COMPLETIONS DURATION AGE ADOPTION
pi-1 member1 0/1 8s 8s Y
pi-10 member1 0/1 8s 8s Y
pi-12 member1 0/1 10s 10s Y
pi-18 member1 0/1 10s 10s Y
pi-19 member1 0/1 9s 9s Y
pi-2 member1 0/1 9s 9s Y
pi-20 member1 0/1 8s 8s Y
pi-3 member1 0/1 10s 10s Y
pi-4 member1 0/1 10s 10s Y
pi-5 member1 0/1 8s 8s Y
pi-6 member1 0/1 10s 10s Y
pi-11 member2 0/1 11s 11s Y
pi-13 member2 0/1 11s 11s Y
pi-14 member2 0/1 7s 7s Y
pi-15 member2 0/1 11s 11s Y
pi-16 member2 0/1 10s 10s Y
pi-17 member2 0/1 10s 10s Y
pi-7 member2 0/1 10s 10s Y
pi-8 member2 0/1 11s 11s Y
pi-9 member2 0/1 11s 11s Y
For function divideRemainingReplicas
, the clusterNames array is sorted by their available resources. So other clusters will be chosen once the resource is less.
Thx, I got it. It may because the size of our member clusters is relatively large, and it is more difficult to trigger. We will find some smaller scale clusters test this.
Hi, @1285yvonne
emmm, we expect that multiple clusters can be selected for job scheduling, so as to prevent the job from being unable to be scheduled when a problem occurs in a single cluster.
From what you described, you want to select multiple clusters to avoid scheduling failure, but the jobs could be just scheduled to only one cluster(failover when this cluster fails, but it's not necessary to split the jobs into multi-clusters), Do I understand correctly?
Or there are other scenarios? i.e., one cluster hasn't enough resources, but with multiple clusters, it has. So you have to split the jobs.
Sorry for replying so late.
Or there are other scenarios? i.e., one cluster hasn't enough resources, but with multiple clusters, it has. So you have to split the jobs.
yes, it's one of our scenarios. There is another scene. For example, a single sub-cluster needs to change or downgrade the core services in the cluster, but we don't expect the change of a single cluster to affect the execution of the new job, that is, we don't want to miss the job. So we need to configure multiple clusters in pp.
What happened:
We had a batch of jobs, and the original PropagationPolicy is configured as clusterA and clusterB. Jobs have been successfully created in both clusters, and only executed in the clusterA cluster.
Then we adjusted pp configuration, remove clusterA, keep only clusterB. We found that the FULLYAPPLIED of all jobs' ResourceBinding(including jobs that have been successfully delivered but not cleared) changed to
False
The reason for false is that the k8s control plane of clusterB rejected the request of karmada-agent to modify the job configuration. At the same time, the old jobs ran in clusterA had been deleted.
Then we adjusted the pp configuration and add
clusterA
backView the rb resources on the control plane, and you can find that the rb resources left over from the failure of the second step of tuning have been retuned successfully, and the job has also been recreated and delivered to clusterA for execution again.
What you expected to happen:
I think this is a bug need to be fixed. When the status of the job changes to execution completion (completed or failure), these jobs should not be rescheduled after the pp change, otherwise it will cause unexpected repeated execution of the job.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl-karmada version
orkarmadactl version
):