k8snetworkplumbingwg / sriov-network-device-plugin

SRIOV network device plugin for Kubernetes
Apache License 2.0
405 stars 176 forks source link

sriov-network-device-plugin can't expose resource in node #586

Open jeffreyyjp opened 2 months ago

jeffreyyjp commented 2 months ago

What happened?

After deploy this plugin, I can't get sriov resource in my nodes. And seems this plugin don't connect with kubelet, I can't find some sentences ` Plugin: mellanox.com/mlnx_sriov_rdma gets registered successfully at Kubelet in below logs.

What did you expect to happen?

Get the specific resource about sriov.

What are the minimal steps needed to reproduce the bug?

  1. Config sriov feature in nodes
  2. Deploy SR-IOV CNI
  3. Deploy sriov-network-device-plugin
  4. Deploy Multus CNI

    Anything else we need to know?

Component Versions

Please fill in the below table with the version numbers of components used.

Component Version
SR-IOV Network Device Plugin v3.7.0
SR-IOV CNI Plugin v2.8.0
Multus v4.1.0
Kubernetes 1.23.6
OS Centos 8.2.2004

Config Files

Config file locations may be config dependent.

Device pool config file location (Try '/etc/pcidp/config.json')
Multus config (Try '/etc/cni/multus/net.d')
CNI config (Try '/etc/cni/net.d/')
Kubernetes deployment type ( Bare Metal, Kubeadm etc.)
Kubeconfig file
SR-IOV Network Custom Resource Definition

Logs

SR-IOV Network Device Plugin Logs (use kubectl logs $PODNAME)

I0815 07:35:27.139922 1 manager.go:57] Using Kubelet Plugin Registry Mode I0815 07:35:27.140181 1 main.go:46] resource manager reading configs I0815 07:35:27.140209 1 manager.go:86] raw ResourceList: { "resourceList": [ { "resourceName": "mlnx_sriov_rdma", "resourcePrefix": "mellanox.com", "selectors": { "vendors": ["15b3"], "devices": ["101c"], "driver": "mlx5_core", "isRdma": true } } ] } I0815 07:35:27.140303 1 factory.go:211] *types.NetDeviceSelectors for resource mlnx_sriov_rdma is [0xc00023f0e0] I0815 07:35:27.140315 1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix:mellanox.com ResourceName:mlnx_sriov_rdma DeviceType:netDevice ExcludeTopology:false Selectors:0xc000190b28 AdditionalInfo:map[] SelectorObjs:[0xc00023f0e0]}] I0815 07:35:27.140347 1 manager.go:217] validating resource name "mellanox.com/mlnx_sriov_rdma" I0815 07:35:27.140354 1 main.go:62] Discovering host devices I0815 07:35:28.124721 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.127282 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.127429 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.127538 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.127629 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.127721 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:47:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.129909 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:47:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.130018 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:47:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.130121 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:47:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.130213 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:47:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.130335 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:68:00.0 02 Intel Corporation Ethernet Controller X710 for 10GbE SFP+ I0815 07:35:28.130477 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:68:00.1 02 Intel Corporation Ethernet Controller X710 for 10GbE SFP+ I0815 07:35:28.130597 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:8e:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.132753 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:8e:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.132855 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:8e:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.132941 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:8e:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.133047 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:8e:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.133140 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d2:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.135266 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d2:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135361 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d2:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135464 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d2:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135548 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d2:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135638 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.135656 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135661 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135666 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135673 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135679 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:47:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.135684 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:47:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135690 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:47:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135694 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:47:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135699 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:47:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135703 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:68:00.0 02 Intel Corporation Ethernet Controller X710 for 10GbE SFP+ I0815 07:35:28.135708 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:68:00.1 02 Intel Corporation Ethernet Controller X710 for 10GbE SFP+ I0815 07:35:28.135713 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:8e:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.135720 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:8e:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135726 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:8e:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135730 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:8e:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135735 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:8e:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135742 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d2:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.135747 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d2:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135752 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d2:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135757 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d2:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135761 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d2:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135765 1 main.go:68] Initializing resource servers I0815 07:35:28.135772 1 manager.go:117] number of config: 1 I0815 07:35:28.135785 1 manager.go:121] Creating new ResourcePool: mlnx_sriov_rdma I0815 07:35:28.135789 1 manager.go:122] DeviceType: netDevice W0815 07:35:28.149419 1 pciNetDevice.go:74] RDMA resources for 0000:68:00.0 not found. Are RDMA modules loaded? W0815 07:35:28.149783 1 pciNetDevice.go:74] RDMA resources for 0000:68:00.1 not found. Are RDMA modules loaded? I0815 07:35:28.156081 1 manager.go:138] initServers(): selector index 0 will register 16 devices I0815 07:35:28.156097 1 factory.go:124] device added: [identifier: 0000:05:00.1, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156105 1 factory.go:124] device added: [identifier: 0000:05:00.2, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156110 1 factory.go:124] device added: [identifier: 0000:05:00.3, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156115 1 factory.go:124] device added: [identifier: 0000:05:00.4, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156118 1 factory.go:124] device added: [identifier: 0000:47:00.1, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156122 1 factory.go:124] device added: [identifier: 0000:47:00.2, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156126 1 factory.go:124] device added: [identifier: 0000:47:00.3, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156130 1 factory.go:124] device added: [identifier: 0000:47:00.4, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156135 1 factory.go:124] device added: [identifier: 0000:8e:00.1, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156139 1 factory.go:124] device added: [identifier: 0000:8e:00.2, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156143 1 factory.go:124] device added: [identifier: 0000:8e:00.3, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156146 1 factory.go:124] device added: [identifier: 0000:8e:00.4, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156150 1 factory.go:124] device added: [identifier: 0000:d2:00.1, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156154 1 factory.go:124] device added: [identifier: 0000:d2:00.2, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156158 1 factory.go:124] device added: [identifier: 0000:d2:00.3, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156162 1 factory.go:124] device added: [identifier: 0000:d2:00.4, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156191 1 manager.go:156] New resource server is created for mlnx_sriov_rdma ResourcePool I0815 07:35:28.156199 1 main.go:74] Starting all servers... I0815 07:35:28.156803 1 server.go:254] starting mlnx_sriov_rdma device plugin endpoint at: mellanox.com_mlnx_sriov_rdma.sock I0815 07:35:28.156947 1 main.go:79] All servers started. I0815 07:35:28.156954 1 main.go:80] Listening for term signals

Multus logs (If enabled. Try '/var/log/multus.log' )
Kubelet logs (journalctl -u kubelet)
rollandf commented 2 months ago

Hey, can you ckeck kubelet logs?

rollandf commented 2 months ago

Also, is kubelet service defined with a --root-dir param?

jeffreyyjp commented 2 months ago

@rollandf Below is the kubelet.service

Description=Kubernetes Kubelet
Documentation=https://github.com/GoogleCloudPlatform/kubernetes

[Service]
WorkingDirectory=/var/lib/kubelet
ExecStartPre=/etc/kubernetes/kubelet-precheck.sh
ExecStart=/usr/bin/kubelet-1.23.6 \
  --kubeconfig=/etc/kubernetes/admin.kubeconfig \
  --config=/etc/kubernetes/kubelet-config.yaml \
  --hostname-override=10.32.13.1 \
   \
  --container-runtime=remote \
  --runtime-request-timeout=15m \
  --container-runtime-endpoint=unix:///run/containerd/containerd.sock \
   \
  --network-plugin=cni \
  --root-dir=/data/kubelet \
  --v=2 \

Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
jeffreyyjp commented 2 months ago

Hey, can you ckeck kubelet logs?

How do I filter logs to find some useful message?

rollandf commented 2 months ago

The issue seems to be with the use of root-dir See similar discussion here: https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/issues/96

BTW, how did you install the cluster? Did you configure the root-dir or it is a default?

jeffreyyjp commented 2 months ago

@rollandf Another question, I want to know if my configmap.yaml is fine for this plugin.

rollandf commented 2 months ago

At first glance, it seems OK. Do you know where the root-dir definition comes from?

jeffreyyjp commented 2 months ago

@rollandf root-dir is defined when kubelet is installed with parms. I guess kubelet only watch root-dir plugins_registry. So I don't get any log like Registering plugin at endpoint" plugin="mellanox.com/mlnx_sriov_rdma" endpoint="/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock" in kubelet. By the way, my root-dir is at /data/kubelet

rollandf commented 2 months ago

What is parms? Any links?

rollandf commented 2 months ago

For now, try to mount to the new root /data/kubelet in the deployment yaml:

/data/kubelet/device-plugins here: https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/blob/master/deployments/sriovdp-daemonset.yaml#L62

/data/kubelet/plugins_registry here: https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/blob/master/deployments/sriovdp-daemonset.yaml#L65

jeffreyyjp commented 2 months ago

@rollandf Sure, But I still need to do ln -s /data/kubelet/plugins_registry /var/lib/kubelet/plugins_registry and I can't find the reason. I guess the device plugin tell kubelet the plugins_registry is in `/var/lib/kubelet· which is hardcoded in container, but the really device plugin sock is in root-dir(/data/kubelet).

Aug 16 14:45:27 10.32.13.1 kubelet-1.23.6[587969]: I0816 14:45:27.456142  587969 manager.go:325] "Registering plugin at endpoint" plugin="mellanox.com/mlnx_sriov_rdma" endpoint="/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock"
Aug 16 14:45:27 10.32.13.1 kubelet-1.23.6[587969]: W0816 14:45:27.456307  587969 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock: connect: no such file or directory". Reconnecting...
Aug 16 14:45:28 10.32.13.1 kubelet-1.23.6[587969]: W0816 14:45:28.456559  587969 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock: connect: no such file or directory". Reconnecting...
Aug 16 14:45:30 10.32.13.1 kubelet-1.23.6[587969]: W0816 14:45:30.045068  587969 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock: connect: no such file or directory". Reconnecting...
Aug 16 14:45:32 10.32.13.1 kubelet-1.23.6[587969]: W0816 14:45:32.244564  587969 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock: connect: no such file or directory". Reconnecting...
Aug 16 14:45:36 10.32.13.1 kubelet-1.23.6[587969]: W0816 14:45:36.863409  587969 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock: connect: no such file or directory". Reconnecting...
Aug 16 14:45:37 10.32.13.1 kubelet-1.23.6[587969]: E0816 14:45:37.456842  587969 endpoint.go:63] "Can't create new endpoint with socket path" err="failed to dial device plugin: context deadline exceeded" path="/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock"
SchSeba commented 2 months ago

Hi @jeffreyyjp can't you update the volume mount for the device plugin container?

adrianchiris commented 2 months ago

i have set "enhancement" label on it since device plugin never supported alternative kubelet root dir.

after seeing @jeffreyyjp latest comment i believe its not enough to update the mounts.

i believe its because of how we do plugin resgistration. we set endpoint to the path within the container in PluginInfo message which is part of GetInfo call. see [1][2]

[1] https://github.com/kubernetes/kubernetes/blob/cb7b4ea648a97bdbf8f4f1b8655a7a110c9f78d0/staging/src/k8s.io/kubelet/pkg/apis/pluginregistration/v1/api.proto#L31 [2]https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/blob/7dedc64cd1b89275059f33d7a2ecae9e03388e79/pkg/resources/server.go#L107

i think, if we leave Endpoint field unset, kubernetes will use the same path for the socket as it did for registration.

jeffreyyjp commented 2 months ago

@SchSeba I already updated my volume mount about host path, but I need to add ln -s /data/kubelet/plugins_registry /var/lib/kubelet/plugins_registry in my host(not container). And then everything is fine.