karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
https://karmada.io
Apache License 2.0
4.11k stars 805 forks source link

Reschedule ResourceBinding when adding a cluster #2301

Closed chaunceyjiang closed 1 year ago

chaunceyjiang commented 1 year ago

Signed-off-by: chaunceyjiang chaunceyjiang@gmail.com

What type of PR is this? /kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes: Fixes #2261

Special notes for your reviewer: When a new cluster is joined, if the Placement is empty or the replicaSchedulingType is Duplicated. Will propagate resources to the new cluster

Does this PR introduce a user-facing change?:

`karmada-scheduler`: Now the scheduler starts to re-schedule in case of cluster state changes.
RainbowMango commented 1 year ago

Is it ready for review now? Please assign it to me once it gets ready.

chaunceyjiang commented 1 year ago

Is it ready for review now? Please assign it to me once it gets ready.

Ready!

chaunceyjiang commented 1 year ago

/assign @Garrybest @XiShanYongYe-Chang @RainbowMango

RainbowMango commented 1 year ago

Got it. Thanks. I can look at it tomorrow.

chaunceyjiang commented 1 year ago
image

This E2E test failed in the case

image

But in my githuh action is successful

RainbowMango commented 1 year ago

I see your commit is based on 7.30 commits. Rebase your code and try again?

Actually, I re-triggered the E2E once this morning.

chaunceyjiang commented 1 year ago

Rebase your code and try again?

ok

Maybe my code is not robust enough. Let me think again

chaunceyjiang commented 1 year ago

/assign @Garrybest @XiShanYongYe-Chang @RainbowMango @dddddai

chaunceyjiang commented 1 year ago

can we have a talk at next meeting?

ok

Garrybest commented 1 year ago

Hi @chaunceyjiang, this PR has a close relation with #2391, I hope these 2 PRs make a cooperation together to solve the rescheduling issues.

Now we have decided to move Failover from scheduler to controller-manager, so when a cluster becomes not-ready or unreachable for a period, controller-manager would add a NoExecute taint on it which triggers a eviction procedure from this cluster. Now #2391 tries to remove Failover from scheduler, however, we still need a rescheduling procedure to make Failover complete.

In conclusion, I hope this PR is not only for adding a cluster but also for more circumstances. Here are some ideas.

  1. More trigger events should be considered.

    • When adding a new cluster, we reschedule all RB/CRBs whose PP/CPP is corresponding with this new cluster.
    • When labels or cluster fields change, we reschedule all RB/CRBs whose PP/CPP is corresponding with this updated cluster.
  2. We could schedule RB/CRB every time in doScheduleBinding and doScheduleClusterBinding if ReplicaSchedulingType is Duplicated.

    • Now when scheduler pops a RB/CRB from queue, it does not know whether the RB/CRB should be rescheduled again when Duplicated because the trigger events are various, e.g. cluster labels changes, some clusters are evicted from RB/CRB, add a new cluster just now, etc. When Failover happens, controller-manager now evicts clusters from RB/CRB, but if SpreadConstraints does not match, we still need scheduler to reschedule again. So it may be a simple and efficient solution that we always schedule RB/CRB with Duplicated policy when events are received.
    • BTW, we don't need to care about RB/CRB with Divided policy. When spec.replicas equals to sum of target scheduled replicas, we will never reschedule this RB/CRB.

/cc @RainbowMango @chaunceyjiang @dddddai @XiShanYongYe-Chang

karmada-bot commented 1 year ago

@Garrybest: GitHub didn't allow me to request PR reviews from the following users: chaunceyjiang.

Note that only karmada-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to [this](https://github.com/karmada-io/karmada/pull/2301#issuecomment-1219554645): >Hi @chaunceyjiang, this PR has a close relation with #2391, I hope these 2 PRs make a cooperation together to solve the rescheduling issues. > >Now we have decided to move Failover from scheduler to controller-manager, so when a cluster becomes not-ready or unreachable for a period, controller-manager would add a NoExecute taint on it which triggers a **eviction** procedure from this cluster. Now #2391 tries to remove Failover from scheduler, however, we still need a **rescheduling** procedure to make Failover complete. > >In conclusion, I hope this PR is not only for adding a cluster but also for more circumstances. Here are some ideas. > >1. More trigger events should be considered. >- When adding a new cluster, we reschedule all RB/CRBs whose PP/CPP is corresponding with this new cluster. >- When labels or cluster fields change, we reschedule all RB/CRBs whose PP/CPP is corresponding with this updated cluster. > >2. We could schedule RB/CRB every time in `doScheduleBinding` and `doScheduleClusterBinding` if `ReplicaSchedulingType` is `Duplicated`. >- Now when scheduler pops a RB/CRB from queue, it does not know whether the RB/CRB should be rescheduled again when `Duplicated` because the trigger events are various, e.g. cluster labels changes, some clusters are evicted from RB/CRB, add a new cluster just now, etc. When Failover happens, controller-manager now evicts clusters from RB/CRB, but if `SpreadConstraints` does not match, we still need scheduler to reschedule again. So it may be a simple and efficient solution that we always schedule RB/CRB with `Duplicated` policy when events are received. >- BTW, we don't need to care about RB/CRB with `Divided` policy. When `spec.replicas` equals to sum of target scheduled replicas, we will never reschedule this RB/CRB. > >/cc @RainbowMango @chaunceyjiang @dddddai @XiShanYongYe-Chang > > > > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
RainbowMango commented 1 year ago

We could schedule RB/CRB every time in doScheduleBinding and doScheduleClusterBinding if ReplicaSchedulingType is Duplicated.

BTW, we don't need to care about RB/CRB with Divided policy. When spec.replicas equals to sum of target scheduled replicas, we will never reschedule this RB/CRB.

Can we just put all related RB/CRB into queue, and don't care whether its ReplicaSchedulingType is Duplicated or not? For the Divided sceanrio, does it means non-opts?

Garrybest commented 1 year ago

Put them into queue does not trigger scheduling, now we only reschedule when placement or replicas changes, please see here

chaunceyjiang commented 1 year ago

When labels or cluster fields change, we reschedule all RB/CRBs whose PP/CPP is corresponding with this updated cluster.

Do you mean to reschedule RB/CRBs when the label of the member cluster changes?

Garrybest commented 1 year ago

Right, based on #2391, this PR could be simplified. When adding a cluster, or cluster labels change, we only need to add the corresponding RB/CRB into queue. Duplicated RB/CRB will be always scheduled every time.

Garrybest commented 1 year ago

/assign

XiShanYongYe-Chang commented 1 year ago

@chaunceyjiang @RainbowMango @Garrybest

The CI e2e test failed for:

2022-08-25T04:54:23.435597696Z stderr F I0825 04:54:23.435529       1 api_enablement.go:38] Cluster(member-e2e-7xk) not fit as missing API(apps/v1, kind=Deployment)
2022-08-25T04:54:23.435998391Z stderr F I0825 04:54:23.435956       1 generic_scheduler.go:108] cluster "member-e2e-7xk" is not fit, reason: no such API resource

When the scheduler detects the cluster create event, the cluster API information is not collected now. As a result, the scheduler filters out the new cluster.

Currently, the modification logic does not detect the cluster status update. Therefore, rescheduling is not triggered after the cluster API collection is complete.

Therefore, we need to adjust the events of the cluster.

RainbowMango commented 1 year ago

Thanks @XiShanYongYe-Chang, That's the root cause.

chaunceyjiang commented 1 year ago

Thank you very much @XiShanYongYe-Chang

But why can #2427 also make this PR pass the E2E test?

https://github.com/chaunceyjiang/karmada/actions/runs/2925842870

RainbowMango commented 1 year ago

But why can https://github.com/karmada-io/karmada/pull/2427 also make this PR pass the E2E test?

Seems the falling E2E in this patch, not in #2427.

XiShanYongYe-Chang commented 1 year ago

But why can #2427 also make this PR pass the E2E test?

Due to the removal of the taint (after the cluster status collection is complete), the re-scheduling of the scheduler will be triggered, which will have the desired effect.

But the addition of stains should be cautious and reasonable, the cluster is reachable when it is just joined, but the status is not synchronized.

chaunceyjiang commented 1 year ago

Due to the removal of the taint (after the cluster status collection is complete), the re-scheduling of the scheduler will be triggered, which will have the desired effect.

yes, this is what happens when I test locally

but the status is not synchronized.

https://github.com/karmada-io/karmada/blob/2daaa2d8c7984fe96f224d7440da6b2fb04111c3/pkg/controllers/cluster/cluster_controller.go#L608-L610

That's why i think when currentReadyCondition is nil , we should add a NotReady taint for the cluster

chaunceyjiang commented 1 year ago

In other words, newly joined clusters should be filtered out by the scheduling plugin TaintToleration instead of the APIEnablement plugin.

RainbowMango commented 1 year ago

In other words, newly joined clusters should be filtered out by the scheduling plugin TaintToleration instead of the APIEnablement plugin.

+1.

Can we continue the discussion on #2427? I'm afraid we can't reach a consensus quickly. The discussion would be should a new cluster deserve a default taint until it's getting ready?

RainbowMango commented 1 year ago

Given the new release is coming soon, I hope this patch could be included, can we move forward with it?

chaunceyjiang commented 1 year ago

should a new cluster deserve a default taint until it's getting ready?

Yes, that's what I want to say

RainbowMango commented 1 year ago

@chaunceyjiang Do you want to wait for #2427?

Can you update the patch in the tradeoff way like https://github.com/karmada-io/karmada/pull/2301#discussion_r954940371 and have a try?

chaunceyjiang commented 1 year ago

@chaunceyjiang Do you want to wait for https://github.com/karmada-io/karmada/pull/2427?

yes.

Can you update the patch in the tradeoff way like https://github.com/karmada-io/karmada/pull/2301#discussion_r954940371 and have a try?

i can try

RainbowMango commented 1 year ago

@chaunceyjiang now #2427 has been merged. could you please revise and rebase your code?

karmada-bot commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/karmada-io/karmada/blob/master/OWNERS)~~ [RainbowMango] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment