intel / intel-device-plugins-for-kubernetes

Collection of Intel device plugins for Kubernetes
Apache License 2.0
52 stars 205 forks source link

GPU crashing on 1 node. #1628

Open ryanm101 opened 11 months ago

ryanm101 commented 11 months ago
NAME   STATUS   ROLES                                       AGE     VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                           KERNEL-VERSION          CONTAINER-RUNTIME
nuc1   Ready    control-plane,etcd,master,worker            2y40d   v1.26.9+k3s1   x.x.x.x   <none>        Fedora Linux 38 (Server Edition)   6.5.6-200.fc38.x86_64   containerd://1.7.6-k3s1.26
nuc2   Ready    control-plane,coral.ai,etcd,master,worker   127m    v1.26.9+k3s1   x.x.x.x   <none>        Fedora Linux 39 (Server Edition)   6.6.2-201.fc39.x86_64   containerd://1.7.6-k3s1.26
nuc3   Ready    control-plane,etcd,master,worker            42d     v1.26.9+k3s1   x.x.x.x   <none>        Fedora Linux 38 (Server Edition)   6.5.8-200.fc38.x86_64   containerd://1.7.6-k3s1.26

Running 3 master nodes using k3s NUC 1 & 3 both deploy fine. NUC 2 the container crashes with

E1216 11:45:32.208374       1 manager.go:146] Failed to serve gpu.intel.com/i915: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/device-plugins/kubelet.sock: connect: permission denied"
Cannot register to kubelet service
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*server).registerWithKubelet
    /go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:352
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*server).setupAndServe
    /go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:280
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*server).Serve
    /go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:207
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*Manager).handleUpdate.func1
    /go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/manager.go:144
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1598

command used to provision NUC2:

curl -sfL https://get.k3s.io | K3S_URL=https://cluster.domain:6443 K3S_TOKEN=1:server:1 INSTALL_K3S_VERSION=v1.26.9+k3s1 sh -s - server --flannel-backend=none --disable-network-policy --cluster-cidr=x.x.x.x/x --service-cidr=x.x.x.x/x --cluster-init --disable=servicelb --disable traefik --selinux

The only differences between NUC2 and NUC1/3 are:

  1. NUC2 is FC39 and the others are FC38
  2. When starting k3s on NUC2 it complained about selinux and said to add '--selinux' to the startup command (the other two nodes dont have this)

Any advice appreciated. I will test re-adding the node without the --selinux and if all else fails change it to FC38.

tkatila commented 11 months ago

Hi @ryanm101

I found a bit similar error here: https://github.com/intel/intel-technology-enabling-for-openshift/issues/113. There are a couple of workarounds in the issue that could work. Could you try them out?

tkatila commented 11 months ago

I reproduced the issue on a VM. Device plugin seems to work without selinux but fails with selinux. In the selinux audit logs there is an entry:

type=AVC msg=audit(1702889339.432:3913): avc:  denied  { connectto } for  pid=16332 comm="intel_gpu_devic" path="/var/lib/kubelet/device-plugins/kubelet.sock" scontext=system_u:system_r:container_device_plugin_t:s0:c620,c968 tcontext=system_u:system_r:container_runtime_t:s0 tclass=unix_stream_socket permissive=0

I'll need to study if this is similar/same as the above linked issue.

EDIT: using setenforce 0 is a workaround. Though, not plausible if selinux is required.

ryanm101 commented 11 months ago

setenforce 0 corrects it but Nuc1&3 are both enforcing and working fine.

tkatila commented 11 months ago

I followed instructions from the audit entry:

sudo ausearch -c 'intel_gpu_devic' --raw | audit2allow -M intelgpudevice
sudo semodule -X 300 -i intelgpudevice.pp

That seems to allow device plugin to access kubelet. I'm not sure where we should file a bug to: FC, k3s or somewhere else.

mregmi commented 11 months ago

The plugins already run with proper label to have access to kubelet. That policy went into container-selinux package. Is that package installed on your node?

ryanm101 commented 11 months ago

Those get installed alongside k3s. and are installed.

ryanm101 commented 11 months ago

I followed instructions from the audit entry:

sudo ausearch -c 'intel_gpu_devic' --raw | audit2allow -M intelgpudevice
sudo semodule -X 300 -i intelgpudevice.pp

That seems to allow device plugin to access kubelet. I'm not sure where we should file a bug to: FC, k3s or somewhere else.

Yes this seems to solve it.

tkatila commented 11 months ago

@mregmi do you happen to know the container-selinux version?

eero-t commented 2 months ago

@tkatila Was this SELinux issue already handled?