k8snetworkplumbingwg / sriov-network-operator

Operator for provisioning and configuring SR-IOV CNI plugin and device plugin
Apache License 2.0
79 stars 109 forks source link

Is there an ability to automatically assign vf with GPU affinity to pods? #736

Open cyclinder opened 1 month ago

cyclinder commented 1 month ago

image

If the gpu and nic are on the same PCIe bridge or their topology distance is at least PHB, then communication between them can be accelerated by enabling GPU Direct RDMA.

SchSeba commented 1 month ago

that is a kubernetes feature. you can configure device manager and check the topology type

https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/#policy-single-numa-node

cyclinder commented 1 month ago

Thanks for your reply, I think even if GPU and Nic are in the same NUMA nodes, they may still cross the PCIe bridge, as shown in the figure above, GPU0 and mlx5_3, so in this case, we cannot enable GPU Direct RDMA. The same NUMA nodes may be a large distance, we may need a smaller distance.

adrianchiris commented 1 month ago

currently there is no solution that im aware of which takes into account PCIe topology.

DRA (Dynamic Resource Allocation) aims to solve that, but there is still a way to go....