heyfey / vodascheduler

GPU scheduler for elastic/distributed deep learning workloads in Kubernetes cluster (IC2E'23)
Apache License 2.0
31 stars 3 forks source link

[placement manager] missing toleration in some migrated pods #1

Closed heyfey closed 2 years ago

heyfey commented 2 years ago

Informer may deliver an Update event with UID changed if a delete is immediately followed by a create.

Placement manager currently only handle pod create, but not pod update, therefore, if a delete-create of pod is reported as pod update, the newly created pod will be missing proper toleration and stucking at pending phase.

ref: https://github.com/microsoft/hivedscheduler/blob/66f26ac47b2ec74bd14278c767e4f9d779ff1682/pkg/scheduler/scheduler.go#L265