Open midnattsol opened 4 months ago
I've seen this PR that modifies the behaviour for the switchdev https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/643
Could this potentially help with the problem?
so there is a global lock in the ib-sriov-cni that should prevent this one
@e0ne @ykulazhenkov is this something you will be able to take a look?
Hi,
can you provide the SriovIbNetwork you defined as well as the SriovPolicy ?
is the problem only that RDMA device changes (i.e mlx5_19 gets recreated/renamed to mlx5_24 after pod was deleted) ? when the new pod starts does it have the correct mounts ? and UCX is unable to cope with RDMA device mlx5_24 having ULPs with different index (e.g uverbs19) ?
Environment
Problem Description When I create the statefulset in parallel, or when it terminates (all the pods terminates at the same time), randomly some interfaces switch the PCI where it points.
So when I check the the device associated in the host they are totally messed
So the pods cannot recognize the mlx interfaces to use them with UCX.
Workaround so far Creating the cluster sequentially and scaling to 0 before terminating the statefulset helps, because the race condition is not triggered, but I guess is not the expected behaviour.