Closed XiShanYongYe-Chang closed 1 year ago
/cc @chaunceyjiang @lxtywypc @zach593 Can you help take a look?
how about let execution-controller watch cluster object's unready -> ready? if we are considering enqueue too often, lastTransitionTime
could help determine how long the cluster was unavailable.
After the cluster is recovered, will all resources on the cluster be resynchronized? If so, will the pressure of the controller suddenly surge?
What does this strategy specifically refer to? https://github.com/karmada-io/karmada/pull/3808#discussion_r1296588633
After the cluster is recovered, will all resources on the cluster be resynchronized? If so, will the pressure of the controller suddenly surge?
So I got a thought for a while, that make objectWatcher use DeepEqual()
checks if there is content changes or not(mutating webhook and API compatibility between karmada and member clusters might be the problem), then it can reduce the frequency of updating member clusters when execution-controller reconciling.
What does this strategy specifically refer to? #3808 (comment)
This makes every affected items ratelimited, clearing items from the ratelimiter may take a lot of time. And if one day we might let user control the options of ratelimiter and number of async worker retries (or just alter to controller-runtime). In that case, recovery mechanism that rely on dynamic parameters is unreliable.
On the other hand, there's no much difference between watch cluster objects with async worker/controller-runtime's retry mechanism, they both trigger the execution-controller reconciling.
/cc @XiShanYongYe-Chang @lxtywypc @RainbowMango
A new version is about to be released. If this issue is not resolved, I think we may need to revert #3808.
@XiShanYongYe-Chang @lxtywypc What's your opinion?
My thoughts may not be correct, but my viewpoint is that the scope of this issue can be controlled. Can we consider incorporating it into the next version and address the problems mentioned in the current issue? Please correct me if I'm wrong.
According to the description in https://github.com/karmada-io/karmada/issues/3999, the finalizer of work was not removed correctly, which will cause ExecutionSpace
not being deleted and thus prevent this cluster from be removed.
I am most concerned about the inability to remove the cluster.
Ok, Thanks /cc @lxtywypc How do you think?
Hi @RainbowMango @chaunceyjiang @XiShanYongYe-Chang
I'm sorry for replying late.
After the discussion among our team, we think that the core issue is that why we need max retry
in AsyncWorker
. We could have a further discussion on this later.
And for #3808, I think it's okay to revert it for the new version coming soon. But the core thought of it we think is still right. We hope it could be brought back when we solve the promblem of max retry
.
Great thanks @lxtywypc
I totally agree that the work of #3808 is incredible, yes, definitely, we can bring it back in the next release(1.8). I also have some ideas about and will leave my thoughts on #3807.
Since we all agree to revert it, who will help to do it?
Since we all agree to revert it, who will help to do it?
I'll do it as soon as possible
I think we could talk about this further more at this time.
Firstly we wonder why we need max retry in AsyncWorker
. Kindly ping @XiShanYongYe-Chang , could you help explain it again with more details or examples?
Hi @lxtywypc, I may not be able to explain the reason why AsyncWorker
needs to set a maximum retry count, we need @RainbowMango 's help to answer this.
IMOP, maybe we don't need the max retry
, directly use the configuration of RateLimitingInterface
will be ok.
Hi @lxtywypc, I have talked to @RainbowMango, and we can do not need this max retry. Would you like to update it?
@XiShanYongYe-Chang @RainbowMango Thanks for your replying.
I'm quite glad to help update it. And I will try to bring #3808 back after it. :)
/assign
What happened:
As described in comment: https://github.com/karmada-io/karmada/pull/3808#discussion_r1295679640
What you expected to happen:
After the failure cluster is recovered, these residual resources can be deleted normally.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl-karmada version
orkarmadactl version
):