koordinator-sh / koordinator

A QoS-based scheduling system brings optimal layout and status to workloads such as microservices, web services, big data jobs, AI jobs, etc.
https://koordinator.sh
Apache License 2.0
1.37k stars 333 forks source link

[BUG] Pods been deleted from gang.Children when recreated with same name. #2148

Open KunWuLuan opened 4 months ago

KunWuLuan commented 4 months ago

What happened:

In my environment, a pod in gang may be recreated with same name. Pod add event came before pod delete event, which cause this pod is deleted from gang.Children. This can result in "No enough children in gang" and no pods can get scheduled.

What you expected to happen:

gang.Children can be set correctly when pod event came in wrong sequence.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

KunWuLuan commented 4 months ago

We can record pod uid when calling GetId() to get podId, so that pod events sequence will be handled correctly.

ZiMengSheng commented 3 months ago

Nice Suggestion! Could you add the detail about why pod event disordered.

stale[bot] commented 4 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. This bot triages issues and PRs according to the following rules: