Warning FailedCreatePodSandBox 31s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "d03d1785bad07f92d23677169acc40ecdd3ff90658d18c39ead55b010438fb4b": plugin type="multus" name="multus-cni-network" failed (add): [drun-test/llama2-master-0/f699f414-842c-40ab-8379-71710eac15c0:sriov-gpu20-enp40s0np0]: error adding container to network "sriov-gpu20-enp40s0np0": failed to set up IPAM plugin type "spiderpool" from the device "enp40s0np0": spiderpool IP allocation error: [POST /ipam/ip][500] postIpamIpFailure failed to allocate IP addresses in standard mode: failed to patch IP allocation results to Endpoint: Operation cannot be fulfilled on [spiderendpoints.spiderpool.spidernet.io](http://spiderendpoints.spiderpool.spidernet.io/) "llama2-master-0": the object has been modified; please apply your changes to the latest version and try again
What did you expect to happen?
success
How to reproduce it (as minimally and precisely as possible)
PyTorch creates jobs in batches, and its job names are named like sequence numbers in stateful applications. Therefore, after creating a set of tasks, the administrator quickly cancels them and creates a new set of tasks. Occasionally, endpoints with the same name remain, and the IP address cannot be allocated.
Additional Context
Solution: The uuid of the pod corresponding to the endpoint does not exist. Detect and delete/update the endpoint object and use gc old data.
Spiderpool Version
v0.9.3
Bug Type
IPAM
Main CNI
macvlan
What happened?
What did you expect to happen?
success
How to reproduce it (as minimally and precisely as possible)
PyTorch creates jobs in batches, and its job names are named like sequence numbers in stateful applications. Therefore, after creating a set of tasks, the administrator quickly cancels them and creates a new set of tasks. Occasionally, endpoints with the same name remain, and the IP address cannot be allocated.
Additional Context
Solution: The uuid of the pod corresponding to the endpoint does not exist. Detect and delete/update the endpoint object and use gc old data.