Reschedule bindings on cluster change

dddddai commented 2 years ago

What happened: Unjoined clusters still remain in binding.spec.clusters

What you expected to happen: Unjoined clusters should be deleted from binding.spec.clusters

How to reproduce it (as minimally and precisely as possible): 1.Set up environment(script v0.8)

root@myserver:~/karmada# hack/local-up-karmada.sh

root@myserver:~/karmada# hack/create-cluster.sh member1 $HOME/.kube/karmada.config

root@myserver:~/karmada# kubectl config use-context karmada-apiserver

root@myserver:~/karmada# karmadactl join member1 --cluster-kubeconfig=$HOME/.kube/karmada.config

root@myserver:~/karmada# kubectl apply -f samples/nginx

root@myserver:~/karmada# kubectl get deploy
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
nginx   1/1     1            1           47h

2.Unjoin member1

root@myserver:~/karmada# karmadactl unjoin member1

root@myserver:~/karmada# kubectl get clusters
No resources found

3.Check binding.spec.clusters

root@myserver:~/karmada# kubectl describe rb
......
Spec:
  Clusters:
    Name:  member1
......

Anything else we need to know?: Is it an expected behavior? If not, who is supposed to take the responsibility to delete unjoined clusters from binding? Scheduler or other controllers (like cluster controller)?

Environment:

Karmada version:v0.8.0
Others:

RainbowMango commented 2 years ago

Thanks for open this we can track the progress and talk about the solution here. Any idea? My thought would be re-schedule the bindings. Two scenarios should be considered:

cluster should be cleaned up from RB after cluster be unjoined
cluster should be added to RB after cluster be joined

dddddai commented 2 years ago

@RainbowMango Thanks for your reply

cluster should be cleaned up from RB after cluster be unjoined

cluster should be added to RB after cluster be joined

Yes this makes sense, but there might be a new problem, let's consider such a case:

RB is scheduled to cluster1 and cluster2
Unjoin cluster1
cluster1 is removed from binding.spec.clusters
Join cluster1
Scheduler should re-schedule the RB, but there is no valid schedule type for it and it would be handled as AvoidSchedule, which doesn't make sense https://github.com/karmada-io/karmada/blob/8e1e16e95081d7431852c6a46ea372ff79a50b5e/pkg/scheduler/scheduler.go#L834-L836

dddddai commented 2 years ago

I have another question: does the scheduler perform Failover after unjoining a target cluster? I don't know cuz I have not practiced this

XiShanYongYe-Chang commented 2 years ago

I have another question: does the scheduler perform Failover after unjoining a target cluster? I don't know cuz I have not practiced this

No.

dddddai commented 2 years ago

@XiShanYongYe-Chang Thanks for your reply

No.

Is it an expected behavior? Shouldn't Failover care about cluster delete event?

XiShanYongYe-Chang commented 2 years ago

When there have cluster delete event, a rescheduling should maybe need to triggere.

How do @RainbowMango think?

RainbowMango commented 2 years ago

echo on https://github.com/karmada-io/karmada/issues/829#issuecomment-945744121.

Agree. But no idea how to solve that issue now.

RainbowMango commented 2 years ago

/priority important-soon @dddddai I added this issue to v0.10 milestone, let's fix this in this release.

dddddai commented 2 years ago

Agree. But no idea how to solve that issue now.

How about removing applied placement annotation of binding on cluster add/delete? It will make the scheduler reschedule the binding as ReconcileSchedule https://github.com/karmada-io/karmada/blob/8e1e16e95081d7431852c6a46ea372ff79a50b5e/pkg/scheduler/scheduler.go#L824-L828

RainbowMango commented 2 years ago

Too tricky I guess.

dddddai commented 2 years ago

IMHO, the scheduler should reschedule the bindings on cluster change(eg. cluster joined, cluster unjoined, cluster label changed...)

I add a cluster queue in scheduler for handling cluster events to fix this issue, and it works fine, please see https://github.com/dddddai/karmada/commit/568b8700e5d2c7de5904a421c2f1a7bcc0ff1648

RainbowMango commented 2 years ago

IMHO, the scheduler should reschedule the bindings on cluster change(eg. cluster joined, cluster unjoined, cluster label changed...)

+1 but not sure for cluster label changed, too sensitive isn't a good thing.

I'll take a look and comment on your commit.

dddddai commented 2 years ago

but not sure for cluster label changed, too sensitive isn't a good thing.

For example, there is a propagation policy whose cluster affinity label selector is foo: bar

Do you mean we should keep the binding scheduled to the cluster though the label foo: bar is removed from that cluster?

RainbowMango commented 2 years ago

I think the scenario might be handled by descheduler.

dddddai commented 2 years ago

Well, I'm not familiar with descheduler, but in this picture it seems that descheduler does not watch cluster events, it just watches workload and might reschedule them to other clusters due to the original cluster resource insufficiency, and it focuses on ScaleSchedule rather than Reschedule,please correct me if I'm wrong

RainbowMango commented 2 years ago

We might discuss the descheduler more at the community meeting. Hope you can come and meet you there.

dddddai commented 2 years ago

OK, I'll be there :)

dddddai commented 2 years ago

To be clear, I have 4 questions:

Shall we reschedule bindings when cluster field/label changed? (because the updated cluster might (not) fit the propagation policy)
Should FailoverSchedule work when unjoining a cluster?
What does SpreadConstraint.MinGroups mean? Does it mean we should not propagate the resource unless group count >= MinGroups? If yes, shall we delete all binding.spec.clusters when unjoining a cluster which causes group count < MinGroups?

The key is: shall we always keep the consistency between propagation policy and binding.spec.clusters?

mrlihanbo commented 2 years ago

Looking forward to seeing the progress of this issue

RainbowMango commented 2 years ago

@mrlihanbo The #967 is waiting for your review.

mrlihanbo commented 2 years ago

@mrlihanbo The #967 is waiting for your review.

I will review the pr now

dddddai commented 2 years ago

Looking forward to seeing the progress of this issue

Hi @mrlihanbo, before implementing this we have to answer the 4 questions above

dddddai commented 2 years ago

Hello @RainbowMango, any ideas about these questions?

RainbowMango commented 2 years ago

Shall we reschedule bindings when cluster field/label changed? (because the updated cluster might (not) fit the propagation policy)

I think we should take this scenario very carefully, it might bring drastic changes. Take Kubernetes as an example, after a pod has been scheduled to a node it will not get re-scheduled even in case of the node label change.

RainbowMango commented 2 years ago

Should FailoverSchedule work when unjoining a cluster?

I think we should re-schedule the workload after one of the bonded clusters is unjoined. It doesn't have to be FailoverSchedule(not sure).

RainbowMango commented 2 years ago

What does SpreadConstraint.MinGroups mean? Does it mean we should not propagate the resource unless group count >= MinGroups?

Yes, you are right. The SpreadConstraint.MinGroups restrict minimum cluster groups, if the scheduler can not find enough cluster groups, the schedule should trigger failure.

If yes, shall we delete all binding.spec.clusters when unjoining a cluster which causes group count < MinGroups?

I don't think so, just like the answer above, it's too dangerous, at least for now.

dddddai commented 2 years ago

I see, so I guess we should reschedule workloads only when cluster joined/unjoined, right?

RainbowMango commented 2 years ago

Let's focus on the scenario of cluster-unjoin for now. The workload should be re-scheduled after one of the bound clusters is unjoined. If no more clusters fit the propagation policy, just remove the unjoined cluster from the binding object.

mrlihanbo commented 2 years ago

ld be re-scheduled after one of the bound clusters is unjoined. If no more clusters fit the propagation policy, just remove the unjoined clu

There exist a scenario if workload should be re-scheduled after one of the bound clusters is unjoined:

User specify cluster A, B, C in policy.
Only cluster A and B are joined when first schedule, so the schedule result will be [A, B].
Then, the cluster C is joined.
if cluster A is not ready and trigger reschedule, the new schedule result will be [B, C]

we should make sure that the behavior is what we expected.

mrlihanbo commented 2 years ago

I see, so I guess we should reschedule workloads only when cluster joined/unjoined, right?

@dddddai I took a glance at the FailoverSchedule func. Maybe we can do it in this func. Just a suggestion.

dddddai commented 2 years ago

@dddddai I took a glance at the FailoverSchedule func. Maybe we can do it in this func. Just a suggestion.

Thanks for digging into it, that's exactly what I did in #1049, the behavior of Reschedule is the same as FailoverSchedule

mrlihanbo commented 2 years ago

@dddddai I took a glance at the FailoverSchedule func. Maybe we can do it in this func. Just a suggestion.

Thanks for digging into it, that's exactly what I did in #1049, the behavior of Reschedule is the same as FailoverSchedule

Good job, I will review the pr ASAP.

RainbowMango commented 2 years ago

/assign @RainbowMango @huone1

karmada-bot commented 2 years ago

@RainbowMango: GitHub didn't allow me to assign the following users: huone1.

Note that only karmada-io members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to [this](https://github.com/karmada-io/karmada/issues/829#issuecomment-1015038188): >/assign @RainbowMango @huone1 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

huone1 commented 2 years ago

let me work it with you @dddddai

dddddai commented 2 years ago

@huone1 Thank you!

duanmengkk commented 2 years ago

Just a the issue mentioned #1644 ,On the scenario of a new cluster join,should the RB be reschedule? @dddddai

dddddai commented 2 years ago

I was thinking so. Would ask @RainbowMango for comments.

duanmengkk commented 2 years ago

I think it's similar to kubernetes, just as deschedule(https://github.com/kubernetes-sigs/descheduler) said , the descheduler should responsible for handle rescheduling of cluster status changing,cluster join and cluster unjoin.

karmada-io / karmada

Reschedule bindings on cluster change #829