Closed jslouisyou closed 3 weeks ago
Hi @jslouisyou can you share the device plugin configmap please
Hi @SchSeba ,
I found there are 2 configmaps in sriov-network-operator
namespace named as device-plugin-config
and supported-nic-ids
and here's the contents.
device-plugin-config
apiVersion: v1
data:
worker01: '{"resourceList":null}'
worker02: '{"resourceList":null}'
worker03: '{"resourceList":null}'
gpu-003: '{"resourceList":[{"resourceName":"gpu_mlnx_ib0","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs17"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib1","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs18"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib2","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs19"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib3","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs20"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib4","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs21f1"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib5","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs22f1"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib0","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp27s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib1","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp103s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib2","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp157s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib3","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp211s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib4","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibp193s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null}]}'
gpu-008: '{"resourceList":[{"resourceName":"gpu_mlnx_ib0","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs17"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib1","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs18"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib2","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs19"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib3","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs20"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib4","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs21f1"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu_mlnx_ib5","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibs22f1"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib0","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp27s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib1","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp103s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib2","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp157s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib3","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp211s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib4","selectors":{"vendors":["15b3"],"devices":["101c"],"pfNames":["ibp193s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null}]}'
kind: ConfigMap
metadata:
creationTimestamp: "2024-06-05T05:29:32Z"
name: device-plugin-config
namespace: sriov-network-operator
resourceVersion: "12170"
uid: 5e9848bd-2ba7-4205-a4be-4bbeff894fec
* `supported-nic-ids`
apiVersion: v1 data: Broadcom_bnxt_BCM57414_2x25G: 14e4 16d7 16dc Broadcom_bnxt_BCM75508_2x100G: 14e4 1750 1806 Intel_i40e_10G_X710_SFP: 8086 1572 154c Intel_i40e_25G_SFP28: 8086 158b 154c Intel_i40e_40G_XL710_QSFP: 8086 1583 154c Intel_i40e_XXV710: 8086 158a 154c Intel_i40e_XXV710_N3000: 8086 0d58 154c Intel_ice_Columbiaville_E810: 8086 1591 1889 Intel_ice_Columbiaville_E810-CQDA2_2CQDA2: 8086 1592 1889 Intel_ice_Columbiaville_E810-XXVDA2: 8086 159b 1889 Intel_ice_Columbiaville_E810-XXVDA4: 8086 1593 1889 Nvidia_mlx5_ConnectX-4: 15b3 1013 1014 Nvidia_mlx5_ConnectX-4LX: 15b3 1015 1016 Nvidia_mlx5_ConnectX-5: 15b3 1017 1018 Nvidia_mlx5_ConnectX-5_Ex: 15b3 1019 101a Nvidia_mlx5_ConnectX-6: 15b3 101b 101c Nvidia_mlx5_ConnectX-6_Dx: 15b3 101d 101e Nvidia_mlx5_ConnectX-7: 15b3 1021 101e Nvidia_mlx5_MT42822_BlueField-2_integrated_ConnectX-6_Dx: 15b3 a2d6 101e Qlogic_qede_QL45000_50G: 1077 1654 1664 Red_Hat_Virtio_network_device: 1af4 1000 1000 kind: ConfigMap metadata: annotations: meta.helm.sh/release-name: sriov-network-operator meta.helm.sh/release-namespace: sriov-network-operator creationTimestamp: "2024-06-05T05:29:22Z" labels: app.kubernetes.io/managed-by: Helm name: supported-nic-ids namespace: sriov-network-operator resourceVersion: "10770" uid: 15d5826e-2e56-4094-8a60-1567beda154b
Hi @jslouisyou if I remember right we introduce a check for
"linkTypes":["infiniband"]
that will run the PF so which should fix the problem if the wrong number of devices after the reboot can you please try the latest device plugin and let us know?
Hi @SchSeba , I upgraded sriov-network-device-plugin
to latest and tested again but this issue still occurs.
Please let me know if I miss something, such as configuration or more.
Hi @jslouisyou can you please provide logs from
Thanks!
Hi @SchSeba
Before this test, I changed all tags for images to latest
and imagePullPolicy
to Always
in order to pull latest images.
<Internal Mirror Repository>/k8snetworkplumbingwg/sriov-cni:latest
<Internal Mirror Repository>/k8snetworkplumbingwg/ib-sriov-cni:latest
<Internal Mirror Repository>/k8snetworkplumbingwg/sriov-network-device-plugin:latest
<Internal Mirror Repository>/k8snetworkplumbingwg/network-resources-injector:latest
<Internal Mirror Repository>/k8snetworkplumbingwg/sriov-network-operator-config-daemon:latest
<Internal Mirror Repository>/k8snetworkplumbingwg/sriov-network-operator-webhook:latest
<Internal Mirror Repository>/k8snetworkplumbingwg/sriov-network-operator:latest
First, here's the log from sriov-device-plugin
when it starts without any pods.
I0820 02:23:13.046505 1 manager.go:57] Using Kubelet Plugin Registry Mode
I0820 02:23:13.046555 1 main.go:46] resource manager reading configs
I0820 02:23:13.046575 1 manager.go:86] raw ResourceList: {"resourceList":[{"resourceName":"gpu2_mlnx_ib2","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp157s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib3","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp211s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null}]}
I0820 02:23:13.046649 1 factory.go:211] *types.NetDeviceSelectors for resource gpu2_mlnx_ib2 is [0xc0004d6120]
I0820 02:23:13.046660 1 factory.go:211] *types.NetDeviceSelectors for resource gpu2_mlnx_ib3 is [0xc0004d6480]
I0820 02:23:13.046663 1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:gpu2_mlnx_ib2 DeviceType:netDevice ExcludeTopology:false Selectors:0xc000324318 AdditionalInfo:map[] SelectorObjs:[0xc0004d6120]} {ResourcePrefix: ResourceName:gpu2_mlnx_ib3 DeviceType:netDevice ExcludeTopology:false Selectors:0xc000324330 AdditionalInfo:map[] SelectorObjs:[0xc0004d6480]}]
I0820 02:23:13.046698 1 manager.go:217] validating resource name "openshift.io/gpu2_mlnx_ib2"
I0820 02:23:13.046709 1 manager.go:217] validating resource name "openshift.io/gpu2_mlnx_ib3"
I0820 02:23:13.046712 1 main.go:62] Discovering host devices
I0820 02:23:13.129037 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:2a:00.0 02 Intel Corporation Ethernet Controller X710 for 10GBASE-T
I0820 02:23:13.129264 1 utils.go:494] excluding interface eno12399: default route found: {Ifindex: 2 Dst: <nil> Src: <nil> Gw: 10.113.240.1 Flags: [] Table: 254 Realm: 0}
I0820 02:23:13.129298 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:2a:00.1 02 Intel Corporation Ethernet Controller X710 for 10GBASE-T
I0820 02:23:13.129407 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:2a:00.2 02 Intel Corporation Ethernet Controller X710 for 10GBASE-T
I0820 02:23:13.129507 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:2a:00.3 02 Intel Corporation Ethernet Controller X710 for 10GBASE-T
I0820 02:23:13.129591 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:41:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:23:13.129699 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:54:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:23:13.129823 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:23:13.131493 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:00.1 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.131590 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:00.2 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.131673 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:00.3 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.131769 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:00.4 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.131862 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:00.5 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.131945 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:00.6 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.132031 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:00.7 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.132121 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:01.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.132208 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:c1:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:23:13.132298 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:23:13.133917 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:00.1 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134009 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:00.2 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134099 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:00.3 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134184 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:00.4 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134264 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:00.5 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134338 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:00.6 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134420 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:00.7 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134480 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:01.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134545 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:e5:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:23:13.134644 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:2a:00.0 02 Intel Corporation Ethernet Controller X710 for 10GBASE-T
I0820 02:23:13.134651 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:2a:00.1 02 Intel Corporation Ethernet Controller X710 for 10GBASE-T
I0820 02:23:13.134654 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:2a:00.2 02 Intel Corporation Ethernet Controller X710 for 10GBASE-T
I0820 02:23:13.134656 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:2a:00.3 02 Intel Corporation Ethernet Controller X710 for 10GBASE-T
I0820 02:23:13.134658 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:41:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:23:13.134661 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:54:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:23:13.134665 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:23:13.134668 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:00.1 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134670 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:00.2 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134672 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:00.3 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134675 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:00.4 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134678 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:00.5 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134681 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:00.6 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134683 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:00.7 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134686 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:01.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134688 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:c1:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:23:13.134691 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:23:13.134694 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:00.1 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134697 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:00.2 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134699 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:00.3 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134702 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:00.4 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134705 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:00.5 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134708 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:00.6 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134710 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:00.7 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134712 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:01.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:23:13.134714 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:e5:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:23:13.134718 1 main.go:68] Initializing resource servers
I0820 02:23:13.134723 1 manager.go:117] number of config: 2
I0820 02:23:13.134732 1 manager.go:121] Creating new ResourcePool: gpu2_mlnx_ib2
I0820 02:23:13.134736 1 manager.go:122] DeviceType: netDevice
W0820 02:23:13.134750 1 pciNetDevice.go:74] RDMA resources for 0000:2a:00.1 not found. Are RDMA modules loaded?
W0820 02:23:13.134933 1 pciNetDevice.go:74] RDMA resources for 0000:2a:00.2 not found. Are RDMA modules loaded?
W0820 02:23:13.135054 1 pciNetDevice.go:74] RDMA resources for 0000:2a:00.3 not found. Are RDMA modules loaded?
I0820 02:23:13.136791 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.1. <nil>
I0820 02:23:13.137550 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.2. <nil>
I0820 02:23:13.138329 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.3. <nil>
I0820 02:23:13.138973 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.4. <nil>
I0820 02:23:13.139728 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.5. <nil>
I0820 02:23:13.140412 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.6. <nil>
I0820 02:23:13.141141 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.7. <nil>
I0820 02:23:13.141829 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:01.0. <nil>
I0820 02:23:13.143062 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.1. <nil>
I0820 02:23:13.143575 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.2. <nil>
I0820 02:23:13.144136 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.3. <nil>
I0820 02:23:13.144803 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.4. <nil>
I0820 02:23:13.145521 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.5. <nil>
I0820 02:23:13.146206 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.6. <nil>
I0820 02:23:13.147007 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.7. <nil>
I0820 02:23:13.147747 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:01.0. <nil>
I0820 02:23:13.148651 1 manager.go:138] initServers(): selector index 0 will register 8 devices
I0820 02:23:13.148659 1 factory.go:124] device added: [identifier: 0000:9d:00.1, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:23:13.148662 1 factory.go:124] device added: [identifier: 0000:9d:00.2, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:23:13.148665 1 factory.go:124] device added: [identifier: 0000:9d:00.3, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:23:13.148667 1 factory.go:124] device added: [identifier: 0000:9d:00.4, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:23:13.148669 1 factory.go:124] device added: [identifier: 0000:9d:00.5, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:23:13.148671 1 factory.go:124] device added: [identifier: 0000:9d:00.6, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:23:13.148673 1 factory.go:124] device added: [identifier: 0000:9d:00.7, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:23:13.148675 1 factory.go:124] device added: [identifier: 0000:9d:01.0, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:23:13.148687 1 manager.go:156] New resource server is created for gpu2_mlnx_ib2 ResourcePool
I0820 02:23:13.148692 1 manager.go:121] Creating new ResourcePool: gpu2_mlnx_ib3
I0820 02:23:13.148694 1 manager.go:122] DeviceType: netDevice
W0820 02:23:13.148705 1 pciNetDevice.go:74] RDMA resources for 0000:2a:00.1 not found. Are RDMA modules loaded?
W0820 02:23:13.148848 1 pciNetDevice.go:74] RDMA resources for 0000:2a:00.2 not found. Are RDMA modules loaded?
W0820 02:23:13.148968 1 pciNetDevice.go:74] RDMA resources for 0000:2a:00.3 not found. Are RDMA modules loaded?
I0820 02:23:13.150473 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.1. <nil>
I0820 02:23:13.151119 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.2. <nil>
.....
I0820 02:23:13.160741 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.7. <nil>
I0820 02:23:13.161432 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:01.0. <nil>
I0820 02:23:13.162404 1 manager.go:138] initServers(): selector index 0 will register 8 devices
I0820 02:23:13.162414 1 factory.go:124] device added: [identifier: 0000:d3:00.1, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:23:13.162419 1 factory.go:124] device added: [identifier: 0000:d3:00.2, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:23:13.162422 1 factory.go:124] device added: [identifier: 0000:d3:00.3, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:23:13.162424 1 factory.go:124] device added: [identifier: 0000:d3:00.4, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:23:13.162426 1 factory.go:124] device added: [identifier: 0000:d3:00.5, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:23:13.162427 1 factory.go:124] device added: [identifier: 0000:d3:00.6, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:23:13.162429 1 factory.go:124] device added: [identifier: 0000:d3:00.7, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:23:13.162431 1 factory.go:124] device added: [identifier: 0000:d3:01.0, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:23:13.162446 1 manager.go:156] New resource server is created for gpu2_mlnx_ib3 ResourcePool
I0820 02:23:13.162451 1 main.go:74] Starting all servers...
I0820 02:23:13.162705 1 server.go:254] starting gpu2_mlnx_ib2 device plugin endpoint at: openshift.io_gpu2_mlnx_ib2.sock
I0820 02:23:13.162997 1 server.go:254] starting gpu2_mlnx_ib3 device plugin endpoint at: openshift.io_gpu2_mlnx_ib3.sock
I0820 02:23:13.163019 1 main.go:79] All servers started.
I0820 02:23:13.163023 1 main.go:80] Listening for term signals
I0820 02:23:13.871003 1 server.go:116] Plugin: openshift.io_gpu2_mlnx_ib3.sock gets registered successfully at Kubelet
I0820 02:23:13.870993 1 server.go:116] Plugin: openshift.io_gpu2_mlnx_ib2.sock gets registered successfully at Kubelet
I0820 02:23:13.871008 1 server.go:157] ListAndWatch(gpu2_mlnx_ib3) invoked
I0820 02:23:13.870984 1 server.go:157] ListAndWatch(gpu2_mlnx_ib2) invoked
I0820 02:23:13.871026 1 server.go:170] ListAndWatch(gpu2_mlnx_ib3): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:d3:00.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:d3:01.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:d3:00.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:d3:00.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:d3:00.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:d3:00.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:d3:00.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:d3:00.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},}
I0820 02:23:13.871061 1 server.go:170] ListAndWatch(gpu2_mlnx_ib2): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:9d:00.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:9d:01.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:9d:00.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:9d:00.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:9d:00.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:9d:00.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:9d:00.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:9d:00.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},}
Second, here's the log from sriov-device-plugin
when any pod is started with using device
I0820 02:24:47.022655 1 server.go:125] Allocate() called with &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[0000:9d:00.7],},},}
I0820 02:24:47.022684 1 pool_stub.go:108] GetEnvs(): for devices: [0000:9d:00.7]
I0820 02:24:47.022729 1 netResourcePool.go:49] GetDeviceSpecs(): for devices: [0000:9d:00.7]
I0820 02:24:47.022737 1 pool_stub.go:141] GetMounts(): for devices: [0000:9d:00.7]
I0820 02:24:47.022740 1 server.go:151] AllocateResponse send: &AllocateResponse{ContainerResponses:[]*ContainerAllocateResponse{&ContainerAllocateResponse{Envs:map[string]string{PCIDEVICE_OPENSHIFT_IO_GPU2_MLNX_IB2: 0000:9d:00.7,PCIDEVICE_OPENSHIFT_IO_GPU2_MLNX_IB2_INFO: {"0000:9d:00.7":{"generic":{"deviceID":"0000:9d:00.7"},"rdma":{"issm":"/dev/infiniband/issm12","rdma_cm":"/dev/infiniband/rdma_cm","umad":"/dev/infiniband/umad12","uverbs":"/dev/infiniband/uverbs12"}}},},Mounts:[]*Mount{},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/infiniband/issm12,HostPath:/dev/infiniband/issm12,Permissions:rw,},&DeviceSpec{ContainerPath:/dev/infiniband/umad12,HostPath:/dev/infiniband/umad12,Permissions:rw,},&DeviceSpec{ContainerPath:/dev/infiniband/uverbs12,HostPath:/dev/infiniband/uverbs12,Permissions:rw,},&DeviceSpec{ContainerPath:/dev/infiniband/rdma_cm,HostPath:/dev/infiniband/rdma_cm,Permissions:rw,},},Annotations:map[string]string{},CDIDevices:[]*CDIDevice{},},},}
I0820 02:24:47.023028 1 server.go:125] Allocate() called with &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[0000:d3:01.0],},},}
I0820 02:24:47.023070 1 pool_stub.go:108] GetEnvs(): for devices: [0000:d3:01.0]
I0820 02:24:47.023106 1 netResourcePool.go:49] GetDeviceSpecs(): for devices: [0000:d3:01.0]
I0820 02:24:47.023115 1 pool_stub.go:141] GetMounts(): for devices: [0000:d3:01.0]
I0820 02:24:47.023121 1 server.go:151] AllocateResponse send: &AllocateResponse{ContainerResponses:[]*ContainerAllocateResponse{&ContainerAllocateResponse{Envs:map[string]string{PCIDEVICE_OPENSHIFT_IO_GPU2_MLNX_IB3: 0000:d3:01.0,PCIDEVICE_OPENSHIFT_IO_GPU2_MLNX_IB3_INFO: {"0000:d3:01.0":{"generic":{"deviceID":"0000:d3:01.0"},"rdma":{"issm":"/dev/infiniband/issm21","rdma_cm":"/dev/infiniband/rdma_cm","umad":"/dev/infiniband/umad21","uverbs":"/dev/infiniband/uverbs21"}}},},Mounts:[]*Mount{},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/infiniband/issm21,HostPath:/dev/infiniband/issm21,Permissions:rw,},&DeviceSpec{ContainerPath:/dev/infiniband/umad21,HostPath:/dev/infiniband/umad21,Permissions:rw,},&DeviceSpec{ContainerPath:/dev/infiniband/uverbs21,HostPath:/dev/infiniband/uverbs21,Permissions:rw,},&DeviceSpec{ContainerPath:/dev/infiniband/rdma_cm,HostPath:/dev/infiniband/rdma_cm,Permissions:rw,},},Annotations:map[string]string{},CDIDevices:[]*CDIDevice{},},},}
Third, here's the log from sriov-device-plugin
when it restarts.
I0820 02:25:23.763525 1 main.go:87] Received signal "terminated", shutting down.
I0820 02:25:23.764141 1 server.go:308] stopping gpu2_mlnx_ib2 device plugin server...
I0820 02:25:23.764178 1 server.go:182] ListAndWatch(gpu2_mlnx_ib2): terminate signal received
I0820 02:25:23.764504 1 server.go:308] stopping gpu2_mlnx_ib3 device plugin server...
I0820 02:25:23.764859 1 server.go:182] ListAndWatch(gpu2_mlnx_ib3): terminate signal received
--- restarts ---
I0820 02:25:24.945536 1 manager.go:57] Using Kubelet Plugin Registry Mode
I0820 02:25:24.945578 1 main.go:46] resource manager reading configs
I0820 02:25:24.945598 1 manager.go:86] raw ResourceList: {"resourceList":[{"resourceName":"gpu2_mlnx_ib2","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp157s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"gpu2_mlnx_ib3","selectors":{"vendors":["15b3"],"devices":["101e"],"pfNames":["ibp211s0"],"linkTypes":["infiniband"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null}]}
I0820 02:25:24.945656 1 factory.go:211] *types.NetDeviceSelectors for resource gpu2_mlnx_ib2 is [0xc0004a17a0]
I0820 02:25:24.945666 1 factory.go:211] *types.NetDeviceSelectors for resource gpu2_mlnx_ib3 is [0xc0004a1b00]
I0820 02:25:24.945669 1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:gpu2_mlnx_ib2 DeviceType:netDevice ExcludeTopology:false Selectors:0xc00032e348 AdditionalInfo:map[] SelectorObjs:[0xc0004a17a0]} {ResourcePrefix: ResourceName:gpu2_mlnx_ib3 DeviceType:netDevice ExcludeTopology:false Selectors:0xc00032e360 AdditionalInfo:map[] SelectorObjs:[0xc0004a1b00]}]
I0820 02:25:24.945701 1 manager.go:217] validating resource name "openshift.io/gpu2_mlnx_ib2"
I0820 02:25:24.945712 1 manager.go:217] validating resource name "openshift.io/gpu2_mlnx_ib3"
I0820 02:25:24.945714 1 main.go:62] Discovering host devices
I0820 02:25:25.031561 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:2a:00.0 02 Intel Corporation Ethernet Controller X710 for 10GBASE-T
I0820 02:25:25.031592 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:2a:00.1 02 Intel Corporation Ethernet Controller X710 for 10GBASE-T
I0820 02:25:25.031596 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:2a:00.2 02 Intel Corporation Ethernet Controller X710 for 10GBASE-T
I0820 02:25:25.031599 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:2a:00.3 02 Intel Corporation Ethernet Controller X710 for 10GBASE-T
I0820 02:25:25.031602 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:41:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:25:25.031605 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:54:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:25:25.031611 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:25:25.031614 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:00.1 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.031617 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:00.2 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.031619 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:00.3 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.031621 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:00.4 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.031623 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:00.5 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.031625 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:00.6 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.031629 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:00.7 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.031630 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:9d:01.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.031633 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:c1:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:25:25.031635 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:25:25.031637 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:00.1 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.031639 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:00.2 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.031641 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:00.3 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.031644 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:00.4 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.031645 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:00.5 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.031648 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:00.6 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.031650 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:00.7 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.031653 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d3:01.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.031655 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:e5:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:25:25.031660 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:2a:00.0 02 Intel Corporation Ethernet Controller X710 for 10GBASE-T
I0820 02:25:25.031871 1 utils.go:494] excluding interface eno12399: default route found: {Ifindex: 2 Dst: <nil> Src: <nil> Gw: 10.113.240.1 Flags: [] Table: 254 Realm: 0}
I0820 02:25:25.031893 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:2a:00.1 02 Intel Corporation Ethernet Controller X710 for 10GBASE-T
I0820 02:25:25.032001 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:2a:00.2 02 Intel Corporation Ethernet Controller X710 for 10GBASE-T
I0820 02:25:25.032102 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:2a:00.3 02 Intel Corporation Ethernet Controller X710 for 10GBASE-T
I0820 02:25:25.032184 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:41:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:25:25.032296 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:54:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:25:25.032404 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:25:25.034023 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:00.1 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.034118 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:00.2 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.034209 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:00.3 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.034309 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:00.4 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.034408 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:00.5 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.034497 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:00.6 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.034584 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:00.7 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.034599 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:9d:01.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.034688 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:c1:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:25:25.034782 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:25:25.036327 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:00.1 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.036415 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:00.2 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.036514 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:00.3 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.036596 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:00.4 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.036673 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:00.5 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.036757 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:00.6 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.036838 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:00.7 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.036907 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d3:01.0 02 Mellanox Technolo... ConnectX Family mlx5Gen Virtual Function
I0820 02:25:25.036922 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:e5:00.0 02 Mellanox Technolo... MT2910 Family [ConnectX-7]
I0820 02:25:25.037021 1 main.go:68] Initializing resource servers
I0820 02:25:25.037026 1 manager.go:117] number of config: 2
I0820 02:25:25.037033 1 manager.go:121] Creating new ResourcePool: gpu2_mlnx_ib2
I0820 02:25:25.037036 1 manager.go:122] DeviceType: netDevice
W0820 02:25:25.037049 1 pciNetDevice.go:74] RDMA resources for 0000:2a:00.1 not found. Are RDMA modules loaded?
W0820 02:25:25.037222 1 pciNetDevice.go:74] RDMA resources for 0000:2a:00.2 not found. Are RDMA modules loaded?
W0820 02:25:25.037341 1 pciNetDevice.go:74] RDMA resources for 0000:2a:00.3 not found. Are RDMA modules loaded?
I0820 02:25:25.039103 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.1. <nil>
I0820 02:25:25.039869 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.2. <nil>
I0820 02:25:25.040615 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.3. <nil>
I0820 02:25:25.041274 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.4. <nil>
I0820 02:25:25.042018 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.5. <nil>
I0820 02:25:25.042695 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.6. <nil>
W0820 02:25:25.042971 1 pciNetDevice.go:74] RDMA resources for 0000:9d:00.7 not found. Are RDMA modules loaded?
I0820 02:25:25.043028 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.7. <nil>
I0820 02:25:25.044580 1 pciNetDevice.go:106] getPKey(): unable to get PKey for device 0000:9d:00.7 : "infiniband directory is empty for device: 0000:9d:00.7"
I0820 02:25:25.044978 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:01.0. <nil>
I0820 02:25:25.046205 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.1. <nil>
I0820 02:25:25.046720 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.2. <nil>
I0820 02:25:25.047275 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.3. <nil>
I0820 02:25:25.047925 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.4. <nil>
I0820 02:25:25.048634 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.5. <nil>
I0820 02:25:25.049280 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.6. <nil>
I0820 02:25:25.050058 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.7. <nil>
W0820 02:25:25.050348 1 pciNetDevice.go:74] RDMA resources for 0000:d3:01.0 not found. Are RDMA modules loaded?
I0820 02:25:25.050405 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:01.0. <nil>
I0820 02:25:25.052074 1 pciNetDevice.go:106] getPKey(): unable to get PKey for device 0000:d3:01.0 : "infiniband directory is empty for device: 0000:d3:01.0"
I0820 02:25:25.052681 1 manager.go:138] initServers(): selector index 0 will register 7 devices
I0820 02:25:25.052688 1 factory.go:124] device added: [identifier: 0000:9d:00.1, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:25:25.052691 1 factory.go:124] device added: [identifier: 0000:9d:00.2, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:25:25.052694 1 factory.go:124] device added: [identifier: 0000:9d:00.3, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:25:25.052696 1 factory.go:124] device added: [identifier: 0000:9d:00.4, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:25:25.052698 1 factory.go:124] device added: [identifier: 0000:9d:00.5, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:25:25.052700 1 factory.go:124] device added: [identifier: 0000:9d:00.6, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:25:25.052702 1 factory.go:124] device added: [identifier: 0000:9d:01.0, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:25:25.052713 1 manager.go:156] New resource server is created for gpu2_mlnx_ib2 ResourcePool
I0820 02:25:25.052717 1 manager.go:121] Creating new ResourcePool: gpu2_mlnx_ib3
I0820 02:25:25.052719 1 manager.go:122] DeviceType: netDevice
W0820 02:25:25.052728 1 pciNetDevice.go:74] RDMA resources for 0000:2a:00.1 not found. Are RDMA modules loaded?
W0820 02:25:25.052875 1 pciNetDevice.go:74] RDMA resources for 0000:2a:00.2 not found. Are RDMA modules loaded?
W0820 02:25:25.052991 1 pciNetDevice.go:74] RDMA resources for 0000:2a:00.3 not found. Are RDMA modules loaded?
I0820 02:25:25.054482 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.1. <nil>
I0820 02:25:25.055114 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.2. <nil>
.....
W0820 02:25:25.058073 1 pciNetDevice.go:74] RDMA resources for 0000:9d:00.7 not found. Are RDMA modules loaded?
I0820 02:25:25.058129 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:9d:00.7. <nil>
I0820 02:25:25.059696 1 pciNetDevice.go:106] getPKey(): unable to get PKey for device 0000:9d:00.7 : "infiniband directory is empty for device: 0000:9d:00.7"
.....
I0820 02:25:25.066066 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.6. <nil>
I0820 02:25:25.066898 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:00.7. <nil>
W0820 02:25:25.067208 1 pciNetDevice.go:74] RDMA resources for 0000:d3:01.0 not found. Are RDMA modules loaded?
I0820 02:25:25.067271 1 utils.go:82] Devlink query for eswitch mode is not supported for device 0000:d3:01.0. <nil>
I0820 02:25:25.069033 1 pciNetDevice.go:106] getPKey(): unable to get PKey for device 0000:d3:01.0 : "infiniband directory is empty for device: 0000:d3:01.0"
I0820 02:25:25.069706 1 manager.go:138] initServers(): selector index 0 will register 7 devices
I0820 02:25:25.069716 1 factory.go:124] device added: [identifier: 0000:d3:00.1, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:25:25.069720 1 factory.go:124] device added: [identifier: 0000:d3:00.2, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:25:25.069723 1 factory.go:124] device added: [identifier: 0000:d3:00.3, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:25:25.069725 1 factory.go:124] device added: [identifier: 0000:d3:00.4, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:25:25.069727 1 factory.go:124] device added: [identifier: 0000:d3:00.5, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:25:25.069730 1 factory.go:124] device added: [identifier: 0000:d3:00.6, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:25:25.069733 1 factory.go:124] device added: [identifier: 0000:d3:00.7, vendor: 15b3, device: 101e, driver: mlx5_core]
I0820 02:25:25.069751 1 manager.go:156] New resource server is created for gpu2_mlnx_ib3 ResourcePool
I0820 02:25:25.069756 1 main.go:74] Starting all servers...
I0820 02:25:25.069999 1 server.go:254] starting gpu2_mlnx_ib2 device plugin endpoint at: openshift.io_gpu2_mlnx_ib2.sock
I0820 02:25:25.070288 1 server.go:254] starting gpu2_mlnx_ib3 device plugin endpoint at: openshift.io_gpu2_mlnx_ib3.sock
I0820 02:25:25.070310 1 main.go:79] All servers started.
I0820 02:25:25.070315 1 main.go:80] Listening for term signals
I0820 02:25:25.953494 1 server.go:157] ListAndWatch(gpu2_mlnx_ib3) invoked
I0820 02:25:25.953515 1 server.go:116] Plugin: openshift.io_gpu2_mlnx_ib3.sock gets registered successfully at Kubelet
I0820 02:25:25.953527 1 server.go:157] ListAndWatch(gpu2_mlnx_ib2) invoked
I0820 02:25:25.953533 1 server.go:116] Plugin: openshift.io_gpu2_mlnx_ib2.sock gets registered successfully at Kubelet
I0820 02:25:25.953523 1 server.go:170] ListAndWatch(gpu2_mlnx_ib3): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:d3:00.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:d3:00.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:d3:00.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:d3:00.7,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:d3:00.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:d3:00.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:d3:00.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},}
I0820 02:25:25.953557 1 server.go:170] ListAndWatch(gpu2_mlnx_ib2): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:9d:00.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:9d:00.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:9d:00.6,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:9d:01.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:9d:00.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:9d:00.2,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},&Device{ID:0000:9d:00.3,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:1,},},},},},}
After restarts sriov-device-plugin
, kubelet
reports wrong number of Capacity
and Allocatable
, which is 7
but 8
is appropriate.
Capacity:
.....
nvidia.com/gpu: 8
openshift.io/gpu2_mlnx_ib2: 7
openshift.io/gpu2_mlnx_ib3: 7
Allocatable:
.....
nvidia.com/gpu: 8
openshift.io/gpu2_mlnx_ib2: 7
openshift.io/gpu2_mlnx_ib3: 7
I hope this might helps!
@jslouisyou i see the following after kubelet restart in device plugin logs:
W0820 02:25:25.058073 1 pciNetDevice.go:74] RDMA resources for 0000:9d:00.7 not found. Are RDMA modules loaded?
W0820 02:25:25.067208 1 pciNetDevice.go:74] RDMA resources for 0000:d3:01.0 not found. Are RDMA modules loaded?
are these VFs currently assigned to pods ?
what is the SriovIBNetwork you have defined ? can you also provide the matching network-attachment-definition used for the workloads pods ?
on the worker node, can you run the following command as root: rdma system
what is the output ?
Hi @adrianchiris !
sriov-device-plugin
pod restarts, I can find network interfaces are attached in Pods. Here's the result of ifconfig
in Pod:
net3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 4092
inet 192.168.224.2 netmask 255.255.240.0 broadcast 192.168.239.255
inet6 fe80::1281:33fc:ce4f:96e prefixlen 64 scopeid 0x20<link>
unspec 00-00-01-AF-FE-80-00-00-00-00-00-00-00-00-00-00 txqueuelen 256 (UNSPEC)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 11 bytes 852 (852.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
net4: flags=4099<UP,BROADCAST,MULTICAST> mtu 4092 inet 192.168.240.2 netmask 255.255.240.0 broadcast 192.168.255.255 unspec 00-00-01-6F-FE-80-00-00-00-00-00-00-00-00-00-00 txqueuelen 256 (UNSPEC) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
and this is sample Pod manifest:
apiVersion: apps/v1 kind: DaemonSet metadata: name: sriov-test labels: app: sriov-test spec: selector: matchLabels: app: sriov-test template: metadata: labels: app: sriov-test annotations: k8s.v1.cni.cncf.io/networks: '[ {"name": "sriov-gpu2-ib2", "interface": "net3"}, {"name": "sriov-gpu2-ib3", "interface": "net4"} ]' spec: nodeSelector: nvidia.com/gpu.product: NVIDIA-H100-80GB-HBM3 tolerations:
sriovibnetworks.sriovnetwork.openshift.io
. Are these essential? It's been working fine without this until now.
$ k get sriovibnetworks.sriovnetwork.openshift.io -A
No resources found
Here's the network-attachment-definition
resources for above 2 IB VFs:
- apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
annotations:
k8s.v1.cni.cncf.io/resourceName: openshift.io/gpu2_mlnx_ib2
name: sriov-gpu2-ib2
namespace: default
spec:
config: |-
{
"cniVersion": "0.3.1",
"name": "sriov_gpu2_ib2",
"plugins": [
{
"type": "ib-sriov",
"link_state": "enable",
"rdmaIsolation": true,
"ibKubernetesEnabled": false
"ipam":
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"log_file": "/tmp/whereabouts.log",
"log_level": "debug",
"type": "whereabouts",
"range": "192.168.224.0/20"
}
}
]
}
---
- apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
annotations:
k8s.v1.cni.cncf.io/resourceName: openshift.io/gpu2_mlnx_ib3
name: sriov-gpu2-ib3
namespace: default
spec:
config: |-
{
"cniVersion": "0.3.1",
"name": "sriov_gpu2_ib3",
"plugins": [
{
"type": "ib-sriov",
"link_state": "enable",
"rdmaIsolation": true,
"ibKubernetesEnabled": false,
"ipam": {
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"log_file": "/tmp/whereabouts.log",
"log_level": "debug",
"type": "whereabouts",
"range": "192.168.240.0/20"
}
}
]
}
rdma system
in worker nodes (same results):
$ rdma system
netns exclusive copy-on-fork on
BTW There aren't any resources for sriovibnetworks.sriovnetwork.openshift.io. Are these essential? It's been working fine without this until now.
if you define network-attachment-definition separately its not required.
i see the system is configured with rdma system exclusive mode so sriov-device-plugin will not find rdma resources that are assigned to container -> it will not register that device (VF) to the pool.
we will need to modify device plugin to handle this case.
devlink dev param
check if enable_rdma
value is true
<- my preference as it could be possible to disable RDMA for specific devicesfor now prehaps use rdma in shared mode if possible.
Thanks @adrianchiris for your quick response!
The RDMA mode wasn't configured by me and given that a large number of GPU devices are currently utilizing RDMA so it seems it would be challenging to modify the mode without complications.
At this point, can I consider this issue as occurred from sriov-device-plugin
right?
At this point, can I consider this issue as occurred from sriov-device-plugin right?
yes
Thanks!
It might be very early to ask, but are there any future plans to resolve this issue?
yes it will be addressed in the near future. i dont have an ETA atm
I will take a look at it.
Hi @rollandf any update on this one?
Thanks @rollandf for resolving this issue! @SchSeba Is there any plan for next release version?
Hi @jslouisyou, you only need the sriov-network-device-plugin or you use it via the sriov-network-operator?
Hi @SchSeba, I'm using sriov-network-device-plugin along with sriov-network-operator, but it's possible to use by upgrading the sriov-network-device-plugin only.
Hi @jslouisyou, here is the new tag https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/releases/tag/v3.8.0
What happened?
Node
Capacity
andAllocatable
number shows wrong in case of restartingsriov-network-device-plugin
if any pods attach SR-IOV IB VFs.What did you expect to happen?
openshift.io/gpu_mlnx_ib#
should be 8 in all VFs.What are the minimal steps needed to reproduce the bug?
sriov-network-operator
version v1.2.0sriov-device-plugin
daemonsetCapacity
andAllocatable
shows full capacity or notAnything else we need to know?
There were several issues already raised and commits were pushed, but it seems that this issue won't be fixed yet. xref) https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/issues/276, https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/issues/521
After restaring
sriov-device-plugin
,kubelet
says thatsriov-device-plugin
pushed its state like below:Even if I changed image version of all components to
latest
, but this issue still occurs.I'm using A100 and H100 nodes.
Component Versions
Please fill in the below table with the version numbers of components used.
sriovCni
: v2.6.3 andibSriovCni
: v1.0.2)Config Files
Config file locations may be config dependent.
Device pool config file location (Try '/etc/pcidp/config.json')
Multus config (Try '/etc/cni/multus/net.d')
CNI config (Try '/etc/cni/net.d/')
Kubernetes deployment type ( Bare Metal, Kubeadm etc.)
Kubeconfig file
SR-IOV Network Custom Resource Definition
Logs
SR-IOV Network Device Plugin Logs (use
kubectl logs $PODNAME
)Multus logs (If enabled. Try '/var/log/multus.log' )
Kubelet logs (journalctl -u kubelet)