karmada-io / karmada

Open, Multi-Cloud, Multi-Cluster Kubernetes Orchestration
https://karmada.io
Apache License 2.0
4.37k stars 865 forks source link

graceful eviction tasks in resource binding never cleaned #3747

Open tedli opened 1 year ago

tedli commented 1 year ago

What happened:

the gracefulEvictionTasks field in spec of resource binding, grows, never cleaned.

What you expected to happen:

if the task already finished, task should removed from gracefulEvictionTasks

How to reproduce it (as minimally and precisely as possible):

patch gracefulEvictionTasks add item to trigger eviction, ensure the task finished, check the graceful eviction tasks field, the task still remain.

Anything else we need to know?:

it may because of line 73, the Patch acts a merge behaviour, which won't remove tasks not kept :

https://github.com/karmada-io/karmada/blob/b01cf50caee8c895c808c2ba7d7dbb75eff2a5b8/pkg/controllers/gracefuleviction/rb_graceful_eviction_controller.go#L64-L76

Environment:

Poor12 commented 1 year ago

Could you please tell me the details about how to reproduce this issue? You edit the ResourceBinding directly?

tedli commented 1 year ago

Hi @Poor12 ,

Thanks for reply. Yes, edit resource binding directly could reproduce this.

But, I did it in a controller (out of tree, 3rd party controller), it's mentioned at #3540 .

The controller watches Cluster, check cluster labels changes, evict resources no longer match placement cluster selector.

And I already change Patch to Put in my environment, and run for a week, the gracefulEvictionTasks can be updated correctly by Put.

Poor12 commented 1 year ago

From the comments, MergeFrom is a replace behavior. I wonder why it does a merge rather than replace. image

liangyuanpeng commented 1 year ago

@tedli

I can check it if you can share some code to reproduce it.

tedli commented 1 year ago

Hi @liangyuanpeng ,

Just edit gracefulEvictionTasks field of resource binding, whether by using kubectl or through api, like I already told at previous comment.

I found this issue because once I check a resource binding using kubectl -o yaml, it output a really huge content, which about hundreds eviction task items. After change patch to put, things fixed.

Currently all my environment had been updated using a mod version scheduler, that replace patch to put, by using put, this issue fixed. Recently I don't have time to setup a new environment to reproduce this.

Feel free to close this issue, if you can't reproduce.

SerenaTiede-Zen commented 2 months ago

+1 Hey we had the same issue, where a resource was not propagating due to gracefulEvictionTask not being removed and the cluster in question reports healthy. I manually deleted the gracefulEvictionTasks and that fixed the issue. Is there a better way of cleaning out stale tasks?

RainbowMango commented 2 months ago

Yes, I guess we can try to reproduce it and figure out the root cause. I'm just curious as @Poor12 why MergeForm not replace the whole list.

@SerenaTiede-Zen would you like to have a try?

@liangyuanpeng are you still interest in this issue?

also cc the author here @XiShanYongYe-Chang

chaosi-zju commented 2 months ago

There is a similar issue: #4951


if the task already finished, task should removed from gracefulEvictionTasks

Actually, is task finished, gracefulEvictionTasks would be removed.

But, the gracefulEvictionTasks of non-workload type resource would not be finished so quick, as https://github.com/karmada-io/karmada/issues/4951#issuecomment-2116568837 described: we don't have a default InterpretHealth resource interpretation behavior for ClusterRole/ConfigMap resources, so the cluster in the gracefunEvictionTasks will wait for the timeout.


Hey we had the same issue, where a resource was not propagating due to gracefulEvictionTask not being removed and the cluster in question reports healthy.

Hi @SerenaTiede-Zen, what type of resource did you use in this issue? Normally, gracefulEvictionTask of deployment would not give you this trouble, while non-workload resource may indeed trouble you.

when cluster become healthy, gracefulEvictionTask of deployment should be finished and removed, while gracefulEvictionTask of non-workload resource will keep exist until task timeout.