karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
https://karmada.io
Apache License 2.0
4.48k stars 885 forks source link

After the failure cluster is recovered, the residual resources are not deleted #3959

Closed XiShanYongYe-Chang closed 1 year ago

XiShanYongYe-Chang commented 1 year ago

What happened:

As described in comment: https://github.com/karmada-io/karmada/pull/3808#discussion_r1295679640

What you expected to happen:

After the failure cluster is recovered, these residual resources can be deleted normally.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

XiShanYongYe-Chang commented 1 year ago

/cc @chaunceyjiang @lxtywypc @zach593 Can you help take a look?

zach593 commented 1 year ago

how about let execution-controller watch cluster object's unready -> ready? if we are considering enqueue too often, lastTransitionTime could help determine how long the cluster was unavailable.

XiShanYongYe-Chang commented 1 year ago

After the cluster is recovered, will all resources on the cluster be resynchronized? If so, will the pressure of the controller suddenly surge?

What does this strategy specifically refer to? https://github.com/karmada-io/karmada/pull/3808#discussion_r1296588633

zach593 commented 1 year ago

After the cluster is recovered, will all resources on the cluster be resynchronized? If so, will the pressure of the controller suddenly surge?

So I got a thought for a while, that make objectWatcher use DeepEqual() checks if there is content changes or not(mutating webhook and API compatibility between karmada and member clusters might be the problem), then it can reduce the frequency of updating member clusters when execution-controller reconciling.

What does this strategy specifically refer to? #3808 (comment)

This makes every affected items ratelimited, clearing items from the ratelimiter may take a lot of time. And if one day we might let user control the options of ratelimiter and number of async worker retries (or just alter to controller-runtime). In that case, recovery mechanism that rely on dynamic parameters is unreliable.

On the other hand, there's no much difference between watch cluster objects with async worker/controller-runtime's retry mechanism, they both trigger the execution-controller reconciling.

chaunceyjiang commented 1 year ago

/cc @XiShanYongYe-Chang @lxtywypc @RainbowMango

A new version is about to be released. If this issue is not resolved, I think we may need to revert #3808.

RainbowMango commented 1 year ago

@XiShanYongYe-Chang @lxtywypc What's your opinion?

XiShanYongYe-Chang commented 1 year ago

My thoughts may not be correct, but my viewpoint is that the scope of this issue can be controlled. Can we consider incorporating it into the next version and address the problems mentioned in the current issue? Please correct me if I'm wrong.

chaunceyjiang commented 1 year ago

According to the description in https://github.com/karmada-io/karmada/issues/3999, the finalizer of work was not removed correctly, which will cause ExecutionSpace not being deleted and thus prevent this cluster from be removed.

I am most concerned about the inability to remove the cluster.

XiShanYongYe-Chang commented 1 year ago

Ok, Thanks /cc @lxtywypc How do you think?

lxtywypc commented 1 year ago

Hi @RainbowMango @chaunceyjiang @XiShanYongYe-Chang
I'm sorry for replying late.

After the discussion among our team, we think that the core issue is that why we need max retry in AsyncWorker. We could have a further discussion on this later.

And for #3808, I think it's okay to revert it for the new version coming soon. But the core thought of it we think is still right. We hope it could be brought back when we solve the promblem of max retry.

RainbowMango commented 1 year ago

Great thanks @lxtywypc

I totally agree that the work of #3808 is incredible, yes, definitely, we can bring it back in the next release(1.8). I also have some ideas about and will leave my thoughts on #3807.

Since we all agree to revert it, who will help to do it?

lxtywypc commented 1 year ago

Since we all agree to revert it, who will help to do it?

I'll do it as soon as possible

lxtywypc commented 1 year ago

I think we could talk about this further more at this time.

Firstly we wonder why we need max retry in AsyncWorker. Kindly ping @XiShanYongYe-Chang , could you help explain it again with more details or examples?

XiShanYongYe-Chang commented 1 year ago

Hi @lxtywypc, I may not be able to explain the reason why AsyncWorker needs to set a maximum retry count, we need @RainbowMango 's help to answer this.

IMOP, maybe we don't need the max retry, directly use the configuration of RateLimitingInterface will be ok.

XiShanYongYe-Chang commented 1 year ago

Hi @lxtywypc, I have talked to @RainbowMango, and we can do not need this max retry. Would you like to update it?

lxtywypc commented 1 year ago

@XiShanYongYe-Chang @RainbowMango Thanks for your replying.

I'm quite glad to help update it. And I will try to bring #3808 back after it. :)

lxtywypc commented 1 year ago

/assign