Closed dddddai closed 2 years ago
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by:
To complete the pull request process, please assign rainbowmango after the PR has been reviewed.
You can assign the PR to them by writing /assign @rainbowmango
in a comment when ready.
The full list of commands accepted by this bot can be found here.
we need to do code refactoring here which will failed for Duplicated
policy:
// TODO: should schedule as much as possible?
deltaLen := len(spec.Clusters) - len(reservedClusters)
if len(candidateClusters) < deltaLen {
// for ReplicaSchedulingTypeDivided, we will try to migrate replicas to the other health clusters
if placement.ReplicaScheduling == nil || placement.ReplicaScheduling.ReplicaSchedulingType == policyv1alpha1.ReplicaSchedulingTypeDuplicated {
klog.Warningf("ignore reschedule binding as insufficient available cluster")
return ScheduleResult{}, nil
}
}
we need to do code refactoring here which will failed for
Duplicated
policy:
What do you mean by "failed for Duplicated
policy"?
If number of available clusters is less than deltaLen
, it would NOT perform reschedule, and I don't know if it's expected
For example, there is a propagation policy that propagates a workload to all clusters, should it perform reschedule when unjoining a cluster? I guess the answer is YES, so is this the difference between Reschedule
and FailoverSchedule
?
we need to do code refactoring here which will failed for
Duplicated
policy:What do you mean by "failed for
Duplicated
policy"? If number of available clusters is less thandeltaLen
, it would NOT perform reschedule, and I don't know if it's expectedFor example, there is a propagation policy that propagates a workload to all clusters, should it perform reschedule when unjoining a cluster? I guess the answer is YES, so is this the difference between
Reschedule
andFailoverSchedule
?
我直接使用中文吧。FailoverSchedule最早是设计成和spread constraint一起用的。所以它的逻辑是:
Duplicated
场景,不配合spread constrain使用的话,会由于选不到新集群导致保留原有结果。原场景: propagation policy里指定[A, B, C]三个集群,spread constrain设为2,所以调度结果是[A, B]. 如果A故障了,会选出C补上,新结果为[B, C]
问题场景: propagation policy里指定[A, B, C]三个集群,不使用spread constrain,调度结果是[A, B, C]。 如果A故障了,选不出新集群,保留[A, B, C]的调度结果。而我们期望的是把unjoin的集群从调度结果里移除。
如果A故障了,选不出新集群,保留[A, B, C]的调度结果。而我们期望的是把unjoin的集群从调度结果里移除。
明白 所以对于Reschedule,我们不考虑“从健康集群中选出相同数目的集群补充”,对吗?这也是Reschedule和FailoverSchedule的唯一区别?
如果A故障了,选不出新集群,保留[A, B, C]的调度结果。而我们期望的是把unjoin的集群从调度结果里移除。
明白 所以对于Reschedule,我们不考虑“从健康集群中选出相同数目的集群补充”,对吗?这也是Reschedule和FailoverSchedule的唯一区别?
有点复杂,需要结合是否配置了spread constraint来判断。这里之前写的也不完善,简单的选出相同数目的集群,其实只能满足SpreadByFieldCluster
这种类型的spread constraint。
删掉不健康集群的调度结果,保留正常集群的调度结果,然后代码直接走normal scheduling逻辑,让调度器二次调度,这样可以么?#1051中我已经把scale scheduling和normal scheduling合并了,我在想failover的逻辑是否可以进行合并?
When all types share the same scheduling logic, spread constraint will be concerned automatically. Now Failover has some defects, e.g., it does not take idle resource into consideration when doing rescheduling. I thought we may merge all types into one scheduling process together.
When all types share the same scheduling logic, spread constraint will be concerned automatically. Now Failover has some defects, e.g., it does not take idle resource into consideration when doing rescheduling. I thought we may merge all types into one scheduling process together.
Agree, I think this is the best way to solve the problem
删掉不健康集群的调度结果,保留正常集群的调度结果,然后代码直接走normal scheduling逻辑,让调度器二次调度,这样可以么?#1051中我已经把scale scheduling和normal scheduling合并了,我在想failover的逻辑是否可以进行合并?
好主意,我去看下合并的pr。failover确实需要大优化。
Please cc me after you made an agreement. :)
Hi @Garrybest, I commited a new patch to merge failover schedule with normal schedule
Thanks @dddddai, I will check it later until #1051 get merged.
/assign
Let's get back to this PR after #1051, is this the plan? Thanks @dddddai, I guess it will be soon. Sorry for letting this sit so long.
Let's get back to this PR after #1051, is this the plan? Thanks @dddddai, I guess it will be soon. Sorry for letting this sit so long.
Hi @RainbowMango, I don't think there is any logical conflict between this and #1051, so the order doesn't matter Never mind, glad to help :), once this and #1051 get merged, all kinds of schedule will share the same process
@dddddai Can you help resolve the conflicts?
We don't have to add a merge
commit. I suggest rebase
.
There is no merge commit, it's commit message :)
@dddddai Let us discuss the PR in the weekly meeting tomorrow and the scenario “unjoining a target cluster” is a common user‘s ’behaviors
there is nothing to requeue the rb about the cluster deleted in function deleteCluster ; I think it should add the requeue logic https://github.com/karmada-io/karmada/blob/8ceb9df2b9672d3e50b0d52d6b4e7d8c000eff81/pkg/scheduler/scheduler.go#L639-L660
@dddddai Let us discuss the PR in the weekly meeting tomorrow and the scenario “unjoining a target cluster” is a common user‘s ’behaviors
@huone1 sorry I would be travelling in the next three days, how about discussing it in the slack channel :)
there is nothing to requeue the rb about the cluster deleted in function deleteCluster ; I think it should add the requeue logic
There's no need to requeue rbs on delete since they are already requeued when the DeletionTimeStamp
is set
https://github.com/karmada-io/karmada/blob/a08dd3d7675b008e5e4137f44843d76880d454b3/pkg/scheduler/scheduler.go#L620-L624
@dddddai Let us discuss the PR in the weekly meeting tomorrow and the scenario “unjoining a target cluster” is a common user‘s ’behaviors
@huone1 sorry I would be travelling in the next three days, how about discussing it in the slack channel :)
OK,have fun!
What type of PR is this? /kind feature
What this PR does / why we need it: Please refer to https://github.com/karmada-io/karmada/issues/829#issuecomment-982235580
Which issue(s) this PR fixes: Part of #829
Special notes for your reviewer:
Does this PR introduce a user-facing change?: