Open jslouisyou opened 3 weeks ago
Hi @jslouisyou can you share the device plugin configmap please
Hi @SchSeba ,
I found there are 2 configmaps in sriov-network-operator
namespace named as device-plugin-config
and supported-nic-ids
and here's the contents.
device-plugin-config
apiVersion: v1
data:
worker01: '{"resourceList":null}'
worker02: '{"resourceList":null}'
worker03: '{"resourceList":null}'
gpu-003: '{"resourceList":[{"resourceName":"gpu_mlnx_ib0","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs17"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib1","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs18"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib2","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs19"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib3","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs20"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib4","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs21f1"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib5","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs22f1"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib0","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp27s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib1","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp103s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib2","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp157s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib3","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp211s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib4","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibp193s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null}]}'
gpu-008: '{"resourceList":[{"resourceName":"gpu_mlnx_ib0","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs17"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib1","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs18"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib2","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs19"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib3","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs20"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib4","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs21f1"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib5","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs22f1"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib0","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp27s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib1","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp103s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib2","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp157s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib3","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp211s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib4","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibp193s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null}]}'
kind: ConfigMap
metadata:
creationTimestamp: "2024-06-05T05:29:32Z"
name: device-plugin-config
namespace: sriov-network-operator
resourceVersion: "12170"
uid: 5e9848bd-2ba7-4205-a4be-4bbeff894fec
* `supported-nic-ids`
apiVersion: v1 data: Broadcom_bnxt_BCM57414_2x25G: 14e4 16d7 16dc Broadcom_bnxt_BCM75508_2x100G: 14e4 1750 1806 Intel_i40e_10G_X710_SFP: 8086 1572 154c Intel_i40e_25G_SFP28: 8086 158b 154c Intel_i40e_40G_XL710_QSFP: 8086 1583 154c Intel_i40e_XXV710: 8086 158a 154c Intel_i40e_XXV710_N3000: 8086 0d58 154c Intel_ice_Columbiaville_E810: 8086 1591 1889 Intel_ice_Columbiaville_E810-CQDA2_2CQDA2: 8086 1592 1889 Intel_ice_Columbiaville_E810-XXVDA2: 8086 159b 1889 Intel_ice_Columbiaville_E810-XXVDA4: 8086 1593 1889 Nvidia_mlx5_ConnectX-4: 15b3 1013 1014 Nvidia_mlx5_ConnectX-4LX: 15b3 1015 1016 Nvidia_mlx5_ConnectX-5: 15b3 1017 1018 Nvidia_mlx5_ConnectX-5_Ex: 15b3 1019 101a Nvidia_mlx5_ConnectX-6: 15b3 101b 101c Nvidia_mlx5_ConnectX-6_Dx: 15b3 101d 101e Nvidia_mlx5_ConnectX-7: 15b3 1021 101e Nvidia_mlx5_MT42822_BlueField-2_integrated_ConnectX-6_Dx: 15b3 a2d6 101e Qlogic_qede_QL45000_50G: 1077 1654 1664 Red_Hat_Virtio_network_device: 1af4 1000 1000 kind: ConfigMap metadata: annotations: meta.helm.sh/release-name: sriov-network-operator meta.helm.sh/release-namespace: sriov-network-operator creationTimestamp: "2024-06-05T05:29:22Z" labels: app.kubernetes.io/managed-by: Helm name: supported-nic-ids namespace: sriov-network-operator resourceVersion: "10770" uid: 15d5826e-2e56-4094-8a60-1567beda154b
What happened?
Node
Capacity
andAllocatable
number shows wrong in case of restartingsriov-network-device-plugin
if any pods attach SR-IOV IB VFs.What did you expect to happen?
openshift.io/gpu_mlnx_ib#
should be 8 in all VFs.What are the minimal steps needed to reproduce the bug?
sriov-network-operator
version v1.2.0sriov-device-plugin
daemonsetCapacity
andAllocatable
shows full capacity or notAnything else we need to know?
There were several issues already raised and commits were pushed, but it seems that this issue won't be fixed yet. xref) https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/issues/276, https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/issues/521
After restaring
sriov-device-plugin
,kubelet
says thatsriov-device-plugin
pushed its state like below:Even if I changed image version of all components to
latest
, but this issue still occurs.I'm using A100 and H100 nodes.
Component Versions
Please fill in the below table with the version numbers of components used.
sriovCni
: v2.6.3 andibSriovCni
: v1.0.2)Config Files
Config file locations may be config dependent.
Device pool config file location (Try '/etc/pcidp/config.json')
Multus config (Try '/etc/cni/multus/net.d')
CNI config (Try '/etc/cni/net.d/')
Kubernetes deployment type ( Bare Metal, Kubeadm etc.)
Kubeconfig file
SR-IOV Network Custom Resource Definition
Logs
SR-IOV Network Device Plugin Logs (use
kubectl logs $PODNAME
)Multus logs (If enabled. Try '/var/log/multus.log' )
Kubelet logs (journalctl -u kubelet)