Closed chaunceyjiang closed 1 year ago
Is it ready for review now? Please assign it to me once it gets ready.
Is it ready for review now? Please assign it to me once it gets ready.
Ready!
/assign @Garrybest @XiShanYongYe-Chang @RainbowMango
Got it. Thanks. I can look at it tomorrow.
This E2E test failed in the case
But in my githuh action is successful
I see your commit is based on 7.30 commits. Rebase your code and try again?
Actually, I re-triggered the E2E once this morning.
Rebase your code and try again?
ok
Maybe my code is not robust enough. Let me think again
/assign @Garrybest @XiShanYongYe-Chang @RainbowMango @dddddai
can we have a talk at next meeting?
ok
Hi @chaunceyjiang, this PR has a close relation with #2391, I hope these 2 PRs make a cooperation together to solve the rescheduling issues.
Now we have decided to move Failover from scheduler to controller-manager, so when a cluster becomes not-ready or unreachable for a period, controller-manager would add a NoExecute taint on it which triggers a eviction procedure from this cluster. Now #2391 tries to remove Failover from scheduler, however, we still need a rescheduling procedure to make Failover complete.
In conclusion, I hope this PR is not only for adding a cluster but also for more circumstances. Here are some ideas.
More trigger events should be considered.
We could schedule RB/CRB every time in doScheduleBinding
and doScheduleClusterBinding
if ReplicaSchedulingType
is Duplicated
.
Duplicated
because the trigger events are various, e.g. cluster labels changes, some clusters are evicted from RB/CRB, add a new cluster just now, etc. When Failover happens, controller-manager now evicts clusters from RB/CRB, but if SpreadConstraints
does not match, we still need scheduler to reschedule again. So it may be a simple and efficient solution that we always schedule RB/CRB with Duplicated
policy when events are received.Divided
policy. When spec.replicas
equals to sum of target scheduled replicas, we will never reschedule this RB/CRB./cc @RainbowMango @chaunceyjiang @dddddai @XiShanYongYe-Chang
@Garrybest: GitHub didn't allow me to request PR reviews from the following users: chaunceyjiang.
Note that only karmada-io members and repo collaborators can review this PR, and authors cannot review their own PRs.
We could schedule RB/CRB every time in doScheduleBinding and doScheduleClusterBinding if ReplicaSchedulingType is Duplicated.
BTW, we don't need to care about RB/CRB with Divided policy. When spec.replicas equals to sum of target scheduled replicas, we will never reschedule this RB/CRB.
Can we just put all related RB/CRB into queue, and don't care whether its ReplicaSchedulingType
is Duplicated
or not?
For the Divided
sceanrio, does it means non-opts?
Put them into queue does not trigger scheduling, now we only reschedule when placement or replicas changes, please see here
When labels or cluster fields change, we reschedule all RB/CRBs whose PP/CPP is corresponding with this updated cluster.
Do you mean to reschedule RB/CRBs when the label of the member cluster changes?
Right, based on #2391, this PR could be simplified. When adding a cluster, or cluster labels change, we only need to add the corresponding RB/CRB into queue. Duplicated RB/CRB will be always scheduled every time.
/assign
@chaunceyjiang @RainbowMango @Garrybest
The CI e2e test failed for:
2022-08-25T04:54:23.435597696Z stderr F I0825 04:54:23.435529 1 api_enablement.go:38] Cluster(member-e2e-7xk) not fit as missing API(apps/v1, kind=Deployment)
2022-08-25T04:54:23.435998391Z stderr F I0825 04:54:23.435956 1 generic_scheduler.go:108] cluster "member-e2e-7xk" is not fit, reason: no such API resource
When the scheduler detects the cluster create event, the cluster API information is not collected now. As a result, the scheduler filters out the new cluster.
Currently, the modification logic does not detect the cluster status update. Therefore, rescheduling is not triggered after the cluster API collection is complete.
Therefore, we need to adjust the events of the cluster.
Thanks @XiShanYongYe-Chang, That's the root cause.
Thank you very much @XiShanYongYe-Chang
But why can #2427 also make this PR pass the E2E test?
https://github.com/chaunceyjiang/karmada/actions/runs/2925842870
But why can https://github.com/karmada-io/karmada/pull/2427 also make this PR pass the E2E test?
Seems the falling E2E in this patch, not in #2427.
But why can #2427 also make this PR pass the E2E test?
Due to the removal of the taint (after the cluster status collection is complete), the re-scheduling of the scheduler will be triggered, which will have the desired effect.
But the addition of stains should be cautious and reasonable, the cluster is reachable when it is just joined, but the status is not synchronized.
Due to the removal of the taint (after the cluster status collection is complete), the re-scheduling of the scheduler will be triggered, which will have the desired effect.
yes, this is what happens when I test locally
but the status is not synchronized.
That's why i think when currentReadyCondition
is nil
, we should add a NotReady
taint for the cluster
In other words, newly joined clusters should be filtered out by the scheduling plugin TaintToleration
instead of the APIEnablement
plugin.
In other words, newly joined clusters should be filtered out by the scheduling plugin TaintToleration instead of the APIEnablement plugin.
+1.
Can we continue the discussion on #2427? I'm afraid we can't reach a consensus quickly.
The discussion would be should a new cluster deserve a default taint until it's getting ready?
Given the new release is coming soon, I hope this patch could be included, can we move forward with it?
should a new cluster deserve a default taint until it's getting ready?
Yes, that's what I want to say
@chaunceyjiang Do you want to wait for #2427?
Can you update the patch in the tradeoff way like https://github.com/karmada-io/karmada/pull/2301#discussion_r954940371 and have a try?
@chaunceyjiang Do you want to wait for https://github.com/karmada-io/karmada/pull/2427?
yes.
Can you update the patch in the tradeoff way like https://github.com/karmada-io/karmada/pull/2301#discussion_r954940371 and have a try?
i can try
@chaunceyjiang now #2427 has been merged. could you please revise and rebase your code?
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: RainbowMango
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Signed-off-by: chaunceyjiang chaunceyjiang@gmail.com
What type of PR is this? /kind feature
What this PR does / why we need it:
Which issue(s) this PR fixes: Fixes #2261
Special notes for your reviewer: When a new cluster is joined, if the
Placement
is empty or thereplicaSchedulingType
isDuplicated
. Will propagate resources to the new clusterDoes this PR introduce a user-facing change?: