k8snetworkplumbingwg / ib-sriov-cni

InfiniBand SR-IOV CNI
Other
40 stars 27 forks source link

GUID not restored to "F"s. on pod deletion. #51

Closed DmytroLinkin closed 3 years ago

DmytroLinkin commented 3 years ago

Current cmdDel behavior is to restore VF's GUID from cache. For some reason GUID isn't restored and for further Adds and Dels setted by ib-sriov-cni GUID becomes "cached".

adrianchiris commented 3 years ago

ib-sriov-cni does not always need to restore the VF's GUID to 0xfff... only in the case where the original GUID was zeroes. the issue observed here IMO is that cmdDel command either is not invoked or fails to restore the GUID to its original state causing the newly unrestored value to be cached.

DmytroLinkin commented 3 years ago

Root cause is next: Created two pods with VFs names, for ex., ib2 and ib4 (host has 4 VFs: ib2 .. ib5). On pod's deletion plugin at first release VF (rename it to initial name and move it to default netns) and then reset VF's configuration, which rebind VF to apply GUID changes. When VF binded back it get name with lowest available number. So if will be deleted pod with VF ib4 on "bind" it'll get name ib2. On deletion of second pod plugin will fail to move VF to default netns since VF with such name already exist.

Also possible situation when user had been setted some defined names to VFs, but after deletion will get other ones.

I see here 2 solutions:

adrianchiris commented 3 years ago

I think we should restore the name (again) after VF rebind in sriovManager.ResetVFConfig(). The end goal is to avoid the driver rebind altogether. (kernel feature)