Device plugin does not detect new or updated VF resources, must kill Pod or redeploy

xagent003 commented 4 years ago

The device plugin only detects device info as a one shot when it starts up. VFs need to be created and drivers bound (if using something besides default kernel driver) BEFORE we create the sriov device-plugin's ConfigMap and spin up the device-plugin daemonset. I accidentally did it other way around, and did not see the device plugin discovering the newly created resource. The capacity/allocatable was reported as zero.

Only way to fix this was to kill each pod (easier to just kubectl delete the daemonset, then re-apply)

IMO, it should periodically monitor devices under its ConfigMap.

There may be cases where the VFs created under a NIC change.

Also, it doesn't place a strict requirement on the workflow of node configuration. We can spin up the DevicePlugin and its configMap, and configure the host networking later, as in my case where I deployed multus/sriov/deviceplugin, then realized I forgot to actually write to the "sriov_numvfs" file.

adrianchiris commented 4 years ago

I would say there are cases that need to be handled:

node state has changed (driver bound, sriov created) that caused change to a resource
config map changed

Today in both cases a restart to device plugin is required to properly report updated resources to kubelet. im +1 on adding support for it.

@zshi-redhat having a periodic monitor for 1 and 2 and updating sriov device plugin resources when needed would also benefit sriov-network-operator which today restarts (delete pods of) sriov device plugin on configuration change WDTY ?

amorenoz commented 4 years ago

How would we handle already allocated resources?

adrianchiris commented 4 years ago

@amorenoz can you elaborate ? you mean keep the IDs of existing devices?

amorenoz commented 4 years ago

@adrianchiris I mean the Device Plugin does not know which of its resources has already been allocated to pods. It is called on Allocate() but not on pod teardown. This may make handling of some changes challenging, e.g: A configMap change removes an allocated resource A configMap change modifies the way an allocated device is exposed (e.g: rdma, vdpa, vhost-net, etc), but the device (and it's information file) is already being used. A configMap change moves a device to a different pool which the pod that is currently using the device is not entitled to use.

Maybe if a change in the node/configMap is detected, the DP could query kubelet and only proceed with the resource pools reconfiguration if none of it's resources are currently being used (also during the reconfiguration process, it must return an error on Allocate())

adrianchiris commented 4 years ago

I understand your point. But should it be the device plugin responsibility to handle that ? You will essentially end up the same if you do the changes and restart device plugin without draining the node.

Now, assuming the Operator (human or otherwise) knows what he/it is doing, does it make sense that sriov device plugin reports and allocates resources based on the current configuration and node state instead of what it discovered on startup? In case both configMap and node state remain un-changed it would be the same.

zshi-redhat commented 4 years ago

I would say there are cases that need to be handled:
1. node state has changed (driver bound, sriov created) that caused change to a resource

I think there are two cases wrt node state changes: 1) changes are applied to existing VFs 2) changes are not applied to existing VFs

For 1), it needs to first drain the node, then conduct the change, otherwise may impact the running workload. so dynamic monitoring won't be useful here as we would anyway need to restart device plugin pod or daemon. For 2), dynamic monitoring in device plugin would allow discovery of a new resource pool.

We would want to see 2) be supported, but not break 1) scenario.

2. config map changed

Same for configmap change, we need to consider adding dynamic monitoring without breaking the use of existing resource.

Another aspect is if the change of configmap splits existing resource pools (whose resource has been allocated to pod) into different pools, we might need to consider adding device health check and reporting the allocated device as unhealthy in the new resource pool via DP API (this way, it wont break the running workload and allow the allocated resource be returned to the new pool once previous pod is deleted)

Today in both cases a restart to device plugin is required to properly report updated resources to kubelet. im +1 on adding support for it.

@zshi-redhat having a periodic monitor for 1 and 2 and updating sriov device plugin resources when needed would also benefit sriov-network-operator which today restarts (delete pods of) sriov device plugin on configuration change WDTY ?

If we can have device plugin keep tracking allocated devcies vs un-allocated devices, then DP can intelligently decide whether it needs to restart or dynamically expose newly created resources. Meanwhile, device plugin may provide a callback function for operator to get the decision on restarting because restarting is triggered by operator.

xagent003 commented 4 years ago

You will essentially end up the same if you do the changes and restart device plugin without draining the node.

I agree with that. We understand that it is disruptive to reconfigure VFs under a NIC. For example to change the # of VFs on a NIC, you have to clear and set it to 0 first. You can't just write a new non-zero number to sriov_numvfs file. At least for us, we make it clear that if you do this you better have already schedule Pods/VMs off this node or expect to lose connectivity. Is there even any other way to reconfigure VFs without it already disrupting allocated resources?

zshi-redhat commented 3 years ago

Is there even any other way to reconfigure VFs without it already disrupting allocated resources?

My understanding is no, @martinkennelly @adrianchiris may have more input.

martinkennelly commented 3 years ago

For Intel NICs (X700 & E800 series) right now, you would have to reset the VF number to zero, and then set it to your desired amount - therefore its disrupting allocated resources.

adrianchiris commented 3 years ago

For Intel NICs (X700 & E800 series) right now, you would have to reset the VF number to zero, and then set it to your desired amount - therefore its disrupting allocated resources.

Same for Nvidia (Mellanox) NICs

However it is possible to modify a VF's "capabilities" e.g when you load rdma drivers, a VF will now have an RDMA device associated with it.

There are plans in kernel to create "top level" devices for a PF/VF with devlink (e.g rdma, sf)

ipatrykx commented 3 years ago

As it is not really a solution for the original issue (which can be really annoying, can confirm that) I think that it maybe might be beneficial to create something like sriovdp-ctl tool that could initialize some kind of rescan command instead of deleting the pod and recreating it. Main benefits I can think of are:

Improved UX - instead of finding the SRIOV DP pod, deleting it and recreating (which additionally takes some time) user could just type something like kubectl sriovdp rescan <NODE_NAME> to refresh the resource list.
This would not break current approach as this would be still manually executed (that's why it's not really an answer to the original issue here).
Also, if DP will expose some kind of API for this then maybe operators (like sriov-network-operator) could use this approach instead of deleting the pods like @adrianchiris said (although not sure yet how such a communication would look like).
We could add additional commands if necessary. Maybe something like kubectl sriovdp get <NODE_NAME> to get a list of resources that are available on the node, or kubectl sriovdp logs <NODE_NAME> to get logs (getting logs from DP is also a bit annoying for me as you still have to find the name of the pod that runs on particular node, and if you delete the pod the name will change - that can be ofc overcome by defining some bash functions, but still it's inconvenient).
This will utilize the fact that, when in fact DP run's just once, a pod is still a long living process.

This would also be somewhat similar to the approach that might be soon implemented in multus according to this issue here: https://github.com/k8snetworkplumbingwg/multus-cni/issues/488

And if so, maybe some of the codebase for this could be reused for similar tools of NPWG and maybe that could be a part of the common utils repository discussed here: https://github.com/k8snetworkplumbingwg/sriov-cni/issues/187

But of course, this is just an idea here, so would love to hear you thoughts.

adrianchiris commented 3 years ago

adding a kubectl plugin would definetly be nice IMO !

regarding the rescan, i think sriov dp should be reactive (to some extent) to both the system state and config map state.

ill bring this up in tue's meeting lets see if we can get additional inputs.

zshi-redhat commented 3 years ago

adding a kubectl plugin would definetly be nice IMO !

+1

regarding the rescan, i think sriov dp should be reactive (to some extent) to both the system state and config map state.

One question is how can we get the state of VF which is attached to sriov pod.

ill bring this up in tue's meeting lets see if we can get additional inputs.

adrianchiris commented 3 years ago

This has been discussed in today's community meeting.

agreement that supporting system resource updates and sriov network device plugin config updates are two different topics that should be addressed separately
kubelet monitoring endpoint can be used to calculate the full system state as there may be devices that are already assigned to pods. see monitoring-device-plugin-resources
a periodic routine can be added to scan for devices and update on added/removed devices
a separate issue was created for device health checks as it is not directly related to this issue but its implementation may reuse code if this gets implemented.

k8snetworkplumbingwg / sriov-network-device-plugin

Device plugin does not detect new or updated VF resources, must kill Pod or redeploy #276