Open jeffreyyjp opened 2 months ago
Hey, can you ckeck kubelet logs?
Also, is kubelet service defined with a --root-dir
param?
@rollandf Below is the kubelet.service
Description=Kubernetes Kubelet
Documentation=https://github.com/GoogleCloudPlatform/kubernetes
[Service]
WorkingDirectory=/var/lib/kubelet
ExecStartPre=/etc/kubernetes/kubelet-precheck.sh
ExecStart=/usr/bin/kubelet-1.23.6 \
--kubeconfig=/etc/kubernetes/admin.kubeconfig \
--config=/etc/kubernetes/kubelet-config.yaml \
--hostname-override=10.32.13.1 \
\
--container-runtime=remote \
--runtime-request-timeout=15m \
--container-runtime-endpoint=unix:///run/containerd/containerd.sock \
\
--network-plugin=cni \
--root-dir=/data/kubelet \
--v=2 \
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
Hey, can you ckeck kubelet logs?
How do I filter logs to find some useful message?
The issue seems to be with the use of root-dir
See similar discussion here:
https://github.com/Mellanox/k8s-rdma-shared-dev-plugin/issues/96
BTW, how did you install the cluster? Did you configure the root-dir
or it is a default?
@rollandf Another question, I want to know if my configmap.yaml is fine for this plugin.
At first glance, it seems OK.
Do you know where the root-dir
definition comes from?
@rollandf root-dir
is defined when kubelet is installed with parms. I guess kubelet only watch root-dir plugins_registry. So I don't get any log like Registering plugin at endpoint" plugin="mellanox.com/mlnx_sriov_rdma" endpoint="/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock"
in kubelet.
By the way, my root-dir is at /data/kubelet
What is parms
? Any links?
For now, try to mount to the new root /data/kubelet
in the deployment yaml:
/data/kubelet/device-plugins
here:
https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/blob/master/deployments/sriovdp-daemonset.yaml#L62
/data/kubelet/plugins_registry
here:
https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/blob/master/deployments/sriovdp-daemonset.yaml#L65
@rollandf Sure, But I still need to do ln -s /data/kubelet/plugins_registry /var/lib/kubelet/plugins_registry
and I can't find the reason. I guess the device plugin tell kubelet the plugins_registry is in `/var/lib/kubelet· which is hardcoded in container, but the really device plugin sock is in root-dir(/data/kubelet).
Aug 16 14:45:27 10.32.13.1 kubelet-1.23.6[587969]: I0816 14:45:27.456142 587969 manager.go:325] "Registering plugin at endpoint" plugin="mellanox.com/mlnx_sriov_rdma" endpoint="/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock"
Aug 16 14:45:27 10.32.13.1 kubelet-1.23.6[587969]: W0816 14:45:27.456307 587969 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock: connect: no such file or directory". Reconnecting...
Aug 16 14:45:28 10.32.13.1 kubelet-1.23.6[587969]: W0816 14:45:28.456559 587969 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock: connect: no such file or directory". Reconnecting...
Aug 16 14:45:30 10.32.13.1 kubelet-1.23.6[587969]: W0816 14:45:30.045068 587969 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock: connect: no such file or directory". Reconnecting...
Aug 16 14:45:32 10.32.13.1 kubelet-1.23.6[587969]: W0816 14:45:32.244564 587969 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock: connect: no such file or directory". Reconnecting...
Aug 16 14:45:36 10.32.13.1 kubelet-1.23.6[587969]: W0816 14:45:36.863409 587969 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock: connect: no such file or directory". Reconnecting...
Aug 16 14:45:37 10.32.13.1 kubelet-1.23.6[587969]: E0816 14:45:37.456842 587969 endpoint.go:63] "Can't create new endpoint with socket path" err="failed to dial device plugin: context deadline exceeded" path="/var/lib/kubelet/plugins_registry/mellanox.com_mlnx_sriov_rdma.sock"
Hi @jeffreyyjp can't you update the volume mount for the device plugin container?
i have set "enhancement" label on it since device plugin never supported alternative kubelet root dir.
after seeing @jeffreyyjp latest comment i believe its not enough to update the mounts.
i believe its because of how we do plugin resgistration. we set endpoint to the path within the container in PluginInfo message which is part of GetInfo call. see [1][2]
[1] https://github.com/kubernetes/kubernetes/blob/cb7b4ea648a97bdbf8f4f1b8655a7a110c9f78d0/staging/src/k8s.io/kubelet/pkg/apis/pluginregistration/v1/api.proto#L31 [2]https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin/blob/7dedc64cd1b89275059f33d7a2ecae9e03388e79/pkg/resources/server.go#L107
i think, if we leave Endpoint
field unset, kubernetes will use the same path for the socket as it did for registration.
@SchSeba I already updated my volume mount about host path, but I need to add ln -s /data/kubelet/plugins_registry /var/lib/kubelet/plugins_registry
in my host(not container). And then everything is fine.
What happened?
After deploy this plugin, I can't get sriov resource in my nodes. And seems this plugin don't connect with kubelet, I can't find some sentences
` Plugin: mellanox.com/mlnx_sriov_rdma gets registered successfully at Kubelet
in below logs.What did you expect to happen?
Get the specific resource about sriov.
What are the minimal steps needed to reproduce the bug?
Anything else we need to know?
Component Versions
Please fill in the below table with the version numbers of components used.
Config Files
Config file locations may be config dependent.
Device pool config file location (Try '/etc/pcidp/config.json')
Multus config (Try '/etc/cni/multus/net.d')
CNI config (Try '/etc/cni/net.d/')
Kubernetes deployment type ( Bare Metal, Kubeadm etc.)
Kubeconfig file
SR-IOV Network Custom Resource Definition
Logs
SR-IOV Network Device Plugin Logs (use
kubectl logs $PODNAME
)I0815 07:35:27.139922 1 manager.go:57] Using Kubelet Plugin Registry Mode I0815 07:35:27.140181 1 main.go:46] resource manager reading configs I0815 07:35:27.140209 1 manager.go:86] raw ResourceList: { "resourceList": [ { "resourceName": "mlnx_sriov_rdma", "resourcePrefix": "mellanox.com", "selectors": { "vendors": ["15b3"], "devices": ["101c"], "driver": "mlx5_core", "isRdma": true } } ] } I0815 07:35:27.140303 1 factory.go:211] *types.NetDeviceSelectors for resource mlnx_sriov_rdma is [0xc00023f0e0] I0815 07:35:27.140315 1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix:mellanox.com ResourceName:mlnx_sriov_rdma DeviceType:netDevice ExcludeTopology:false Selectors:0xc000190b28 AdditionalInfo:map[] SelectorObjs:[0xc00023f0e0]}] I0815 07:35:27.140347 1 manager.go:217] validating resource name "mellanox.com/mlnx_sriov_rdma" I0815 07:35:27.140354 1 main.go:62] Discovering host devices I0815 07:35:28.124721 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.127282 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.127429 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.127538 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.127629 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:05:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.127721 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:47:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.129909 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:47:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.130018 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:47:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.130121 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:47:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.130213 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:47:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.130335 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:68:00.0 02 Intel Corporation Ethernet Controller X710 for 10GbE SFP+ I0815 07:35:28.130477 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:68:00.1 02 Intel Corporation Ethernet Controller X710 for 10GbE SFP+ I0815 07:35:28.130597 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:8e:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.132753 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:8e:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.132855 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:8e:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.132941 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:8e:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.133047 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:8e:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.133140 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d2:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.135266 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d2:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135361 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d2:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135464 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d2:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135548 1 netDeviceProvider.go:67] netdevice AddTargetDevices(): device found: 0000:d2:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135638 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.135656 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135661 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135666 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135673 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:05:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135679 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:47:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.135684 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:47:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135690 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:47:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135694 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:47:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135699 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:47:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135703 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:68:00.0 02 Intel Corporation Ethernet Controller X710 for 10GbE SFP+ I0815 07:35:28.135708 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:68:00.1 02 Intel Corporation Ethernet Controller X710 for 10GbE SFP+ I0815 07:35:28.135713 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:8e:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.135720 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:8e:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135726 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:8e:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135730 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:8e:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135735 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:8e:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135742 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d2:00.0 02 Mellanox Technolo... MT28908 Family [ConnectX-6] I0815 07:35:28.135747 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d2:00.1 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135752 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d2:00.2 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135757 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d2:00.3 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135761 1 auxNetDeviceProvider.go:84] auxnetdevice AddTargetDevices(): device found: 0000:d2:00.4 02 Mellanox Technolo... MT28908 Family [ConnectX-6 Virtual Fu... I0815 07:35:28.135765 1 main.go:68] Initializing resource servers I0815 07:35:28.135772 1 manager.go:117] number of config: 1 I0815 07:35:28.135785 1 manager.go:121] Creating new ResourcePool: mlnx_sriov_rdma I0815 07:35:28.135789 1 manager.go:122] DeviceType: netDevice W0815 07:35:28.149419 1 pciNetDevice.go:74] RDMA resources for 0000:68:00.0 not found. Are RDMA modules loaded? W0815 07:35:28.149783 1 pciNetDevice.go:74] RDMA resources for 0000:68:00.1 not found. Are RDMA modules loaded? I0815 07:35:28.156081 1 manager.go:138] initServers(): selector index 0 will register 16 devices I0815 07:35:28.156097 1 factory.go:124] device added: [identifier: 0000:05:00.1, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156105 1 factory.go:124] device added: [identifier: 0000:05:00.2, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156110 1 factory.go:124] device added: [identifier: 0000:05:00.3, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156115 1 factory.go:124] device added: [identifier: 0000:05:00.4, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156118 1 factory.go:124] device added: [identifier: 0000:47:00.1, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156122 1 factory.go:124] device added: [identifier: 0000:47:00.2, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156126 1 factory.go:124] device added: [identifier: 0000:47:00.3, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156130 1 factory.go:124] device added: [identifier: 0000:47:00.4, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156135 1 factory.go:124] device added: [identifier: 0000:8e:00.1, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156139 1 factory.go:124] device added: [identifier: 0000:8e:00.2, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156143 1 factory.go:124] device added: [identifier: 0000:8e:00.3, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156146 1 factory.go:124] device added: [identifier: 0000:8e:00.4, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156150 1 factory.go:124] device added: [identifier: 0000:d2:00.1, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156154 1 factory.go:124] device added: [identifier: 0000:d2:00.2, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156158 1 factory.go:124] device added: [identifier: 0000:d2:00.3, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156162 1 factory.go:124] device added: [identifier: 0000:d2:00.4, vendor: 15b3, device: 101c, driver: mlx5_core] I0815 07:35:28.156191 1 manager.go:156] New resource server is created for mlnx_sriov_rdma ResourcePool I0815 07:35:28.156199 1 main.go:74] Starting all servers... I0815 07:35:28.156803 1 server.go:254] starting mlnx_sriov_rdma device plugin endpoint at: mellanox.com_mlnx_sriov_rdma.sock I0815 07:35:28.156947 1 main.go:79] All servers started. I0815 07:35:28.156954 1 main.go:80] Listening for term signals
Multus logs (If enabled. Try '/var/log/multus.log' )
Kubelet logs (journalctl -u kubelet)