spidernet-io / spiderpool

Underlay and RDMA network solution of the Kubernetes, for bare metal, VM and any public cloud
https://spidernet-io.github.io/spiderpool/
Apache License 2.0
523 stars 75 forks source link

The job application was rebuilt and the same endpoint name was kept which made it impossible to assign an IP address. #3699

Closed ty-dc closed 2 weeks ago

ty-dc commented 2 months ago

Spiderpool Version

v0.9.3

Bug Type

IPAM

Main CNI

macvlan

What happened?

Warning  FailedCreatePodSandBox  31s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "d03d1785bad07f92d23677169acc40ecdd3ff90658d18c39ead55b010438fb4b": plugin type="multus" name="multus-cni-network" failed (add): [drun-test/llama2-master-0/f699f414-842c-40ab-8379-71710eac15c0:sriov-gpu20-enp40s0np0]: error adding container to network "sriov-gpu20-enp40s0np0": failed to set up IPAM plugin type "spiderpool" from the device "enp40s0np0": spiderpool IP allocation error: [POST /ipam/ip][500] postIpamIpFailure  failed to allocate IP addresses in standard mode: failed to patch IP allocation results to Endpoint: Operation cannot be fulfilled on [spiderendpoints.spiderpool.spidernet.io](http://spiderendpoints.spiderpool.spidernet.io/) "llama2-master-0": the object has been modified; please apply your changes to the latest version and try again

What did you expect to happen?

success

How to reproduce it (as minimally and precisely as possible)

PyTorch creates jobs in batches, and its job names are named like sequence numbers in stateful applications. Therefore, after creating a set of tasks, the administrator quickly cancels them and creates a new set of tasks. Occasionally, endpoints with the same name remain, and the IP address cannot be allocated.

Additional Context

Solution: The uuid of the pod corresponding to the endpoint does not exist. Detect and delete/update the endpoint object and use gc old data.

ty-dc commented 2 weeks ago

fix #3778