nimbix / k8s-rdma-device-plugin

RDMA device plugin for Kubernetes
Apache License 2.0
0 stars 1 forks source link

DaemonSet deployment does not work #1

Open nvcastet opened 6 years ago

nvcastet commented 6 years ago

I tried using the daemonset(pulling the image from dockerhub) and I got those errors/warnings below:

[root@mycluster k8s-rdma-device-plugin]# kubectl -n kube-system apply -f rdma-device-plugin.yml
daemonset.extensions "rdma-device-plugin-daemonset" created
[root@mycluster k8s-rdma-device-plugin]# kubectl logs rdma-device-plugin-daemonset-j7cf2 -n kube-system
time="2018-08-15T19:30:31Z" level=info msg="Fetching devices."
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs3
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs2
time="2018-08-15T19:30:31Z" level=info msg="No devices found."
[root@mycluster k8s-rdma-device-plugin]# kubectl logs rdma-device-plugin-daemonset-kt6lg  -n kube-system
time="2018-08-15T19:30:35Z" level=info msg="Fetching devices."
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs3
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs2
time="2018-08-15T19:30:35Z" level=info msg="No devices found."

I am running on Power9 boxes with one connectx-5 mellanox card on each system. If run k8s-rdma-device-plugin outside of a container on the worker node it works well. Did I miss a setting to run the plugin in a daemonset? @johnathanhegge

Thank you,

Nicolas Castet

nvcastet commented 6 years ago

@nimbixler @johnathanhegge Any idea for this issue?

johnathanhegge commented 6 years ago

@nvcastet I'm just back and looking into this. I expect that the container isn't mapping in the devices, looking at the code.

johnathanhegge commented 6 years ago

Those errors look to be bubbling up from a mismatch in libraries. I made a new branch, deploy-bionic, which you can try. This changes the xenial image out for bionic. I'm hoping this addresses the issue with newer mlx drivers.

nvcastet commented 6 years ago

Still xenial image on dockerhub:

# docker run -td nimbix/k8s-rdma-device-plugin:1.10-bionic
9c24160b6fde1b6350574db07433b719ee1f6274613fa918e179a480ab8eee41
# docker exec -it 9c24160b6fde1b6350574db07433b719ee1f6274613fa918e179a480ab8eee41 bash
root@9c24160b6fde:/# cat /etc/os-release 
NAME="Ubuntu"
VERSION="16.04.5 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.5 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
johnathanhegge commented 6 years ago

Apologies, I missed a change on the ppc64le arch but now recreated the manifest and pushed it. I checked it, looks correct:

/# cat /etc/os-release NAME="Ubuntu" VERSION="18.04.1 LTS (Bionic Beaver)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 18.04.1 LTS" VERSION_ID="18.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=bionic UBUNTU_CODENAME=bionic

nvcastet commented 6 years ago

@johnathanhegge Thanks, so I think now the devices are correctly detected:

[root@mycluster k8s-rdma-device-plugin]# kubectl logs rdma-device-plugin-daemonset-2vltw        -n kube-system
time="2018-08-28T16:30:12Z" level=info msg="Fetching devices."
time="2018-08-28T16:30:12Z" level=debug msg="RDMA device list: [{{mlx5_2 uverbs2 /sys/class/infiniband_verbs/uverbs2 /sys/class/infiniband/mlx5_2} ib2} {{mlx5_0 uverbs0 /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband/mlx5_0} ib0} {{mlx5_3 uverbs3 /sys/class/infiniband_verbs/uverbs3 /sys/class/infiniband/mlx5_3} ib3} {{mlx5_1 uverbs1 /sys/class/infiniband_verbs/uverbs1 /sys/class/infiniband/mlx5_1} ib1}]"
time="2018-08-28T16:30:12Z" level=info msg="Starting FS watcher."
time="2018-08-28T16:30:12Z" level=info msg="Failed to created FS watcher."

But it is failing with Failed to created FS watcher. Have you encountered this issue in the past?

johnathanhegge commented 6 years ago

Well, that's progress but I've not seen this issue with the notifier. Looks like permissions.

I found a similar bug with the nvidia plugin: https://github.com/NVIDIA/k8s-device-plugin/issues/65.

I'd say start with a describe on the rdma pod, see if it looks right:

kubectl -n kube-system describe rdma-device-plugin-daemonset-<UNIQUESTRING> for one of your pods

Are you using Kubernetes 1.10 or 1.11?

nvcastet commented 6 years ago

@johnathanhegge I am using 1.10. So it is related to RHEL SELinux permissions. If i disable SELinux (setenforce 0), it works well. If i set the good SELinux type to the device plugin folder: sudo chcon -R -t container_file_t /var/lib/kubelet/device-plugins It gets further but still has a permission issue for some reason:

time="2018-08-29T19:58:45Z" level=error msg="Could not register device plugin: context deadline exceeded"
time="2018-08-29T19:58:45Z" level=info msg="Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?"
time="2018-08-29T19:58:45Z" level=info msg="Starting to serve on /var/lib/kubelet/device-plugins/rdma.sock"
2018/08/29 19:58:45 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix /var/lib/kubelet/device-plugins/kubelet.sock: connect: permission denied"; Reconnecting to {/var/lib/kubelet/device-plugins/kubelet.sock <nil>}
2018/08/29 19:58:46 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix /var/lib/kubelet/device-plugins/kubelet.sock: connect: permission denied"; Reconnecting to {/var/lib/kubelet/device-plugins/kubelet.sock <nil>}
2018/08/29 19:58:47 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial unix /var/lib/kubelet/device-plugins/kubelet.sock: connect: permission denied"; Reconnecting to {/var/lib/kubelet/device-plugins/kubelet.sock <nil>}

@johnathanhegge Have you updated SELinux permissions on the worker node to make it work?

johnathanhegge commented 6 years ago

@nvcastet we are running Ubuntu on the host side and in the container so no SELinux permission issues