Open xagent003 opened 4 years ago
I would say there are cases that need to be handled:
Today in both cases a restart to device plugin is required to properly report updated resources to kubelet. im +1 on adding support for it.
@zshi-redhat having a periodic monitor for 1 and 2 and updating sriov device plugin resources when needed would also benefit sriov-network-operator which today restarts (delete pods of) sriov device plugin on configuration change WDTY ?
How would we handle already allocated resources?
@amorenoz can you elaborate ? you mean keep the IDs of existing devices?
@adrianchiris I mean the Device Plugin does not know which of its resources has already been allocated to pods. It is called on Allocate() but not on pod teardown. This may make handling of some changes challenging, e.g: A configMap change removes an allocated resource A configMap change modifies the way an allocated device is exposed (e.g: rdma, vdpa, vhost-net, etc), but the device (and it's information file) is already being used. A configMap change moves a device to a different pool which the pod that is currently using the device is not entitled to use.
Maybe if a change in the node/configMap is detected, the DP could query kubelet and only proceed with the resource pools reconfiguration if none of it's resources are currently being used (also during the reconfiguration process, it must return an error on Allocate())
I understand your point. But should it be the device plugin responsibility to handle that ? You will essentially end up the same if you do the changes and restart device plugin without draining the node.
Now, assuming the Operator (human or otherwise) knows what he/it is doing, does it make sense that sriov device plugin reports and allocates resources based on the current configuration and node state instead of what it discovered on startup? In case both configMap and node state remain un-changed it would be the same.
I would say there are cases that need to be handled:
1. node state has changed (driver bound, sriov created) that caused change to a resource
I think there are two cases wrt node state changes: 1) changes are applied to existing VFs 2) changes are not applied to existing VFs
For 1), it needs to first drain the node, then conduct the change, otherwise may impact the running workload. so dynamic monitoring won't be useful here as we would anyway need to restart device plugin pod or daemon. For 2), dynamic monitoring in device plugin would allow discovery of a new resource pool.
We would want to see 2) be supported, but not break 1) scenario.
2. config map changed
Same for configmap change, we need to consider adding dynamic monitoring without breaking the use of existing resource.
Another aspect is if the change of configmap splits existing resource pools (whose resource has been allocated to pod) into different pools, we might need to consider adding device health check and reporting the allocated device as unhealthy in the new resource pool via DP API (this way, it wont break the running workload and allow the allocated resource be returned to the new pool once previous pod is deleted)
Today in both cases a restart to device plugin is required to properly report updated resources to kubelet. im +1 on adding support for it.
@zshi-redhat having a periodic monitor for 1 and 2 and updating sriov device plugin resources when needed would also benefit sriov-network-operator which today restarts (delete pods of) sriov device plugin on configuration change WDTY ?
If we can have device plugin keep tracking allocated devcies vs un-allocated devices, then DP can intelligently decide whether it needs to restart or dynamically expose newly created resources. Meanwhile, device plugin may provide a callback function for operator to get the decision on restarting because restarting is triggered by operator.
You will essentially end up the same if you do the changes and restart device plugin without draining the node.
I agree with that. We understand that it is disruptive to reconfigure VFs under a NIC. For example to change the # of VFs on a NIC, you have to clear and set it to 0 first. You can't just write a new non-zero number to sriov_numvfs file. At least for us, we make it clear that if you do this you better have already schedule Pods/VMs off this node or expect to lose connectivity. Is there even any other way to reconfigure VFs without it already disrupting allocated resources?
Is there even any other way to reconfigure VFs without it already disrupting allocated resources?
My understanding is no, @martinkennelly @adrianchiris may have more input.
For Intel NICs (X700 & E800 series) right now, you would have to reset the VF number to zero, and then set it to your desired amount - therefore its disrupting allocated resources.
For Intel NICs (X700 & E800 series) right now, you would have to reset the VF number to zero, and then set it to your desired amount - therefore its disrupting allocated resources.
Same for Nvidia (Mellanox) NICs
However it is possible to modify a VF's "capabilities" e.g when you load rdma drivers, a VF will now have an RDMA device associated with it.
There are plans in kernel to create "top level" devices for a PF/VF with devlink (e.g rdma, sf)
As it is not really a solution for the original issue (which can be really annoying, can confirm that) I think that it maybe might be beneficial to create something like sriovdp-ctl
tool that could initialize some kind of rescan
command instead of deleting the pod and recreating it. Main benefits I can think of are:
kubectl sriovdp rescan <NODE_NAME>
to refresh the resource list.kubectl sriovdp get <NODE_NAME>
to get a list of resources that are available on the node, or kubectl sriovdp logs <NODE_NAME>
to get logs (getting logs from DP is also a bit annoying for me as you still have to find the name of the pod that runs on particular node, and if you delete the pod the name will change - that can be ofc overcome by defining some bash functions, but still it's inconvenient).This would also be somewhat similar to the approach that might be soon implemented in multus according to this issue here: https://github.com/k8snetworkplumbingwg/multus-cni/issues/488
And if so, maybe some of the codebase for this could be reused for similar tools of NPWG and maybe that could be a part of the common utils
repository discussed here: https://github.com/k8snetworkplumbingwg/sriov-cni/issues/187
But of course, this is just an idea here, so would love to hear you thoughts.
adding a kubectl plugin would definetly be nice IMO !
regarding the rescan, i think sriov dp should be reactive (to some extent) to both the system state and config map state.
ill bring this up in tue's meeting lets see if we can get additional inputs.
adding a kubectl plugin would definetly be nice IMO !
+1
regarding the rescan, i think sriov dp should be reactive (to some extent) to both the system state and config map state.
One question is how can we get the state of VF which is attached to sriov pod.
ill bring this up in tue's meeting lets see if we can get additional inputs.
This has been discussed in today's community meeting.
The device plugin only detects device info as a one shot when it starts up. VFs need to be created and drivers bound (if using something besides default kernel driver) BEFORE we create the sriov device-plugin's ConfigMap and spin up the device-plugin daemonset. I accidentally did it other way around, and did not see the device plugin discovering the newly created resource. The capacity/allocatable was reported as zero.
Only way to fix this was to kill each pod (easier to just kubectl delete the daemonset, then re-apply)
IMO, it should periodically monitor devices under its ConfigMap.
There may be cases where the VFs created under a NIC change.
Also, it doesn't place a strict requirement on the workflow of node configuration. We can spin up the DevicePlugin and its configMap, and configure the host networking later, as in my case where I deployed multus/sriov/deviceplugin, then realized I forgot to actually write to the "sriov_numvfs" file.