Open nvcastet opened 6 years ago
@nimbixler @johnathanhegge Any idea for this issue?
@nvcastet I'm just back and looking into this. I expect that the container isn't mapping in the devices, looking at the code.
Those errors look to be bubbling up from a mismatch in libraries. I made a new branch, deploy-bionic, which you can try. This changes the xenial image out for bionic. I'm hoping this addresses the issue with newer mlx drivers.
Still xenial image on dockerhub:
# docker run -td nimbix/k8s-rdma-device-plugin:1.10-bionic
9c24160b6fde1b6350574db07433b719ee1f6274613fa918e179a480ab8eee41
# docker exec -it 9c24160b6fde1b6350574db07433b719ee1f6274613fa918e179a480ab8eee41 bash
root@9c24160b6fde:/# cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.5 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
Apologies, I missed a change on the ppc64le arch but now recreated the manifest and pushed it. I checked it, looks correct:
/# cat /etc/os-release NAME="Ubuntu" VERSION="18.04.1 LTS (Bionic Beaver)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 18.04.1 LTS" VERSION_ID="18.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=bionic UBUNTU_CODENAME=bionic
@johnathanhegge Thanks, so I think now the devices are correctly detected:
[root@mycluster k8s-rdma-device-plugin]# kubectl logs rdma-device-plugin-daemonset-2vltw -n kube-system
time="2018-08-28T16:30:12Z" level=info msg="Fetching devices."
time="2018-08-28T16:30:12Z" level=debug msg="RDMA device list: [{{mlx5_2 uverbs2 /sys/class/infiniband_verbs/uverbs2 /sys/class/infiniband/mlx5_2} ib2} {{mlx5_0 uverbs0 /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband/mlx5_0} ib0} {{mlx5_3 uverbs3 /sys/class/infiniband_verbs/uverbs3 /sys/class/infiniband/mlx5_3} ib3} {{mlx5_1 uverbs1 /sys/class/infiniband_verbs/uverbs1 /sys/class/infiniband/mlx5_1} ib1}]"
time="2018-08-28T16:30:12Z" level=info msg="Starting FS watcher."
time="2018-08-28T16:30:12Z" level=info msg="Failed to created FS watcher."
But it is failing with Failed to created FS watcher
. Have you encountered this issue in the past?
Well, that's progress but I've not seen this issue with the notifier. Looks like permissions.
I found a similar bug with the nvidia plugin: https://github.com/NVIDIA/k8s-device-plugin/issues/65.
I'd say start with a describe on the rdma pod, see if it looks right:
kubectl -n kube-system describe rdma-device-plugin-daemonset-<UNIQUESTRING>
for one of your pods
Are you using Kubernetes 1.10 or 1.11?
@johnathanhegge I am using 1.10.
So it is related to RHEL SELinux permissions. If i disable SELinux (setenforce 0
), it works well.
If i set the good SELinux type to the device plugin folder:
sudo chcon -R -t container_file_t /var/lib/kubelet/device-plugins
It gets further but still has a permission issue for some reason:
time="2018-08-29T19:58:45Z" level=error msg="Could not register device plugin: context deadline exceeded"
time="2018-08-29T19:58:45Z" level=info msg="Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?"
time="2018-08-29T19:58:45Z" level=info msg="Starting to serve on /var/lib/kubelet/device-plugins/rdma.sock"
2018/08/29 19:58:45 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix /var/lib/kubelet/device-plugins/kubelet.sock: connect: permission denied"; Reconnecting to {/var/lib/kubelet/device-plugins/kubelet.sock <nil>}
2018/08/29 19:58:46 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix /var/lib/kubelet/device-plugins/kubelet.sock: connect: permission denied"; Reconnecting to {/var/lib/kubelet/device-plugins/kubelet.sock <nil>}
2018/08/29 19:58:47 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix /var/lib/kubelet/device-plugins/kubelet.sock: connect: permission denied"; Reconnecting to {/var/lib/kubelet/device-plugins/kubelet.sock <nil>}
@johnathanhegge Have you updated SELinux permissions on the worker node to make it work?
@nvcastet we are running Ubuntu on the host side and in the container so no SELinux permission issues
I tried using the daemonset(pulling the image from dockerhub) and I got those errors/warnings below:
I am running on Power9 boxes with one connectx-5 mellanox card on each system. If run k8s-rdma-device-plugin outside of a container on the worker node it works well. Did I miss a setting to run the plugin in a daemonset? @johnathanhegge
Thank you,
Nicolas Castet