tkestack / gpu-manager

Other
826 stars 235 forks source link

Error: Unable to set Type=notify in systemd service file? #151

Open Fvoiretryzig opened 2 years ago

Fvoiretryzig commented 2 years ago

I compile gpu-manager to arm64 and run it on jetson nano. However, when I run kubectl create -f gpu-manager.yaml, it shows

copy /usr/local/host/lib/aarch64-linux-gnu/libcuda.so to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/tegra/libcuda.so to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/tegra/libcuda.so.1 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.440.18 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/tegra/libnvidia-ptxjitcompiler.so.1 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/tegra/libnvidia-fatbinaryloader.so.440.18 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libGL.so.1.0.0 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libGL.so to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libGL.so.1 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libGLX.so to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libGLX.so.0 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libGLX.so.0.0.0 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libOpenGL.so.0.0.0 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libOpenGL.so to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libOpenGL.so.0 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libGLESv1_CM.so.1 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libGLESv1_CM.so.1.0.0 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libGLESv1_CM.so to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libGLESv2.so to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libGLESv2.so.2.0.0 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libGLESv2.so.2 to /usr/local/nvidia/lib
copy /usr/local/host/lib/chromium-browser/swiftshader/libGLESv2.so to /usr/local/nvidia/lib
copy /usr/local/host/lib/chromium-browser/libGLESv2.so to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libEGL.so to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libEGL.so.1 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libEGL.so.1.0.0 to /usr/local/nvidia/lib
copy /usr/local/host/lib/chromium-browser/libEGL.so to /usr/local/nvidia/lib
copy /usr/local/host/lib/chromium-browser/swiftshader/libEGL.so to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libGLdispatch.so to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libGLdispatch.so.0 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/libGLdispatch.so.0.0.0 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/tegra/libGLX_nvidia.so.0 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/tegra-egl/libEGL_nvidia.so.0 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/tegra-egl/libGLESv2_nvidia.so.2 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/tegra-egl/libGLESv1_CM_nvidia.so.1 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/tegra/libnvidia-eglcore.so.32.5.1 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/tegra/libnvidia-egl-wayland.so to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/tegra/libnvidia-egl-wayland.so.1 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/tegra/libnvidia-glcore.so.32.5.1 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/tegra/libnvidia-tls.so.32.5.1 to /usr/local/nvidia/lib
copy /usr/local/host/lib/aarch64-linux-gnu/tegra/libnvidia-glsi.so.32.5.1 to /usr/local/nvidia/lib
rebuild ldcache
launch gpu manager
E0412 01:51:13.374667   32218 server.go:133] Unable to set Type=notify in systemd service file?

According to 7#issue and 40#issue, I modify the yaml file and ensure docker runtime is runc not nvidia-container-runtime. This is my yaml file:

apiVersion: apps/v1                                                                                                                                                         
kind: DaemonSet
metadata:
  name: gpu-manager-daemonset
  namespace: kube-system
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      name: gpu-manager-ds
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: gpu-manager-ds
    spec:
      serviceAccount: gpu-manager
      tolerations:
        # This toleration is deprecated. Kept here for backward compatibility
        # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
        - key: CriticalAddonsOnly
          operator: Exists
        - key: tencent.com/vcuda-core
          operator: Exists
          effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      # only run node has gpu device
      nodeSelector:
        nvidia-device-enable: enable
      hostPID: true
      containers:
        - image: myimage/gpu-manager:latest
          imagePullPolicy: Always
          name: gpu-manager
          securityContext:
            privileged: true
          ports:
            - containerPort: 5678
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: vdriver
              mountPath: /etc/gpu-manager/vdriver
            - name: vmdata
              mountPath: /etc/gpu-manager/vm
            - name: log
              mountPath: /var/log/gpu-manager
            - name: checkpoint
              mountPath: /etc/gpu-manager/checkpoint
            - name: run-dir
              mountPath: /var/run
            - name: cgroup
              mountPath: /sys/fs/cgroup
              readOnly: true
            - name: usr-directory
              mountPath: /usr/local/host
              readOnly: true
            - name: kube-root
              mountPath: /root/.kube
              readOnly: true
          env:
            - name: LOG_LEVEL
              value: "5"
            - name: EXTRA_FLAGS
              value: "--logtostderr=false --cgroup-driver=systemd"
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
      volumes:
        - name: device-plugin
          hostPath:
            type: Directory
            path: /var/lib/kubelet/device-plugins
        - name: vmdata
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vm
        - name: vdriver
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vdriver
        - name: log
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/log
        - name: checkpoint
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/checkpoint
        # We have to mount the whole /var/run directory into container, because of bind mount docker.sock
        # inode change after host docker is restarted
        - name: run-dir
          hostPath:
            type: Directory
            path: /var/run
        - name: cgroup
          hostPath:
            type: Directory
            path: /sys/fs/cgroup
        # We have to mount /usr directory instead of specified library path, because of non-existing
        # problem for different distro
        - name: usr-directory
          hostPath:
            type: Directory
            path: /usr
        - name: kube-root
          hostPath:
            type: Directory
            path: /root/.kube

I copy the .kube directory in master node to each work node. How can I deal with this error

phoenixwu0229 commented 2 years ago

哥们这个 问题解决了吗

Fvoiretryzig commented 2 years ago

哥们这个 问题解决了吗

@phoenixwu0229 还没有

fu7100 commented 2 years ago

我在openshift4上也遇到这个问题,我按照faq说明修改了container-runtime-endpoint以及cgroup为systemd

然后容器启动就报错: rebuild ldcache launch gpu manager E0516 02:59:32.771447 1270729 server.go:131] Unable to set Type=notify in systemd service file? F0516 02:59:33.872799 1270729 tree.go:102] Can not initialize nvidia tree, err no input goroutine 10 [running]: k8s.io/klog.stacks(0xc000109c00, 0xc000016000, 0x58, 0x193) /go/pkg/mod/k8s.io/klog@v1.0.0/klog.go:875 +0xb8 k8s.io/klog.(loggingT).output(0x27ae5a0, 0xc000000003, 0xc0001c0230, 0x250db7f, 0x7, 0x66, 0x0) /go/pkg/mod/k8s.io/klog@v1.0.0/klog.go:826 +0x330 k8s.io/klog.(loggingT).printf(0x27ae5a0, 0x3, 0x17d4c8c, 0x26, 0xc0003ebe30, 0x1, 0x1) /go/pkg/mod/k8s.io/klog@v1.0.0/klog.go:707 +0x14b k8s.io/klog.Fatalf(...) /go/pkg/mod/k8s.io/klog@v1.0.0/klog.go:1276 tkestack.io/gpu-manager/pkg/device/nvidia.(NvidiaTree).Init(0xc0001c6140, 0x0, 0x0) /root/rpmbuild/BUILD/gpu-manager-1.1.5/pkg/device/nvidia/tree.go:102 +0x128 tkestack.io/gpu-manager/pkg/server.(managerImpl).Run(0xc00004a7c0, 0xc000136dc0, 0x0) /root/rpmbuild/BUILD/gpu-manager-1.1.5/pkg/server/server.go:171 +0x66b created by tkestack.io/gpu-manager/cmd/manager/app.Run /root/rpmbuild/BUILD/gpu-manager-1.1.5/cmd/manager/app/app.go:83 +0x3da

kitt1987 commented 2 years ago

我在openshift4上也遇到这个问题,我按照faq说明修改了container-runtime-endpoint以及cgroup为systemd

  • name: EXTRA_FLAGS

    value: "--logtostderr=false"

    value: "--logtostderr=false --container-runtime-endpoint=/var/run/crio/crio.sock --cgroup-driver=systemd"

然后容器启动就报错: rebuild ldcache launch gpu manager E0516 02:59:32.771447 1270729 server.go:131] Unable to set Type=notify in systemd service file? F0516 02:59:33.872799 1270729 tree.go:102] Can not initialize nvidia tree, err no input goroutine 10 [running]: k8s.io/klog.stacks(0xc000109c00, 0xc000016000, 0x58, 0x193) /go/pkg/mod/k8s.io/klog@v1.0.0/klog.go:875 +0xb8 k8s.io/klog.(loggingT).output(0x27ae5a0, 0xc000000003, 0xc0001c0230, 0x250db7f, 0x7, 0x66, 0x0) /go/pkg/mod/k8s.io/klog@v1.0.0/klog.go:826 +0x330 k8s.io/klog.(loggingT).printf(0x27ae5a0, 0x3, 0x17d4c8c, 0x26, 0xc0003ebe30, 0x1, 0x1) /go/pkg/mod/k8s.io/klog@v1.0.0/klog.go:707 +0x14b k8s.io/klog.Fatalf(...) /go/pkg/mod/k8s.io/klog@v1.0.0/klog.go:1276 tkestack.io/gpu-manager/pkg/device/nvidia.(NvidiaTree).Init(0xc0001c6140, 0x0, 0x0) /root/rpmbuild/BUILD/gpu-manager-1.1.5/pkg/device/nvidia/tree.go:102 +0x128 tkestack.io/gpu-manager/pkg/server.(managerImpl).Run(0xc00004a7c0, 0xc000136dc0, 0x0) /root/rpmbuild/BUILD/gpu-manager-1.1.5/pkg/server/server.go:171 +0x66b created by tkestack.io/gpu-manager/cmd/manager/app.Run /root/rpmbuild/BUILD/gpu-manager-1.1.5/cmd/manager/app/app.go:83 +0x3da

Try to install the NVIDIA GPU driver first.

kooqi commented 2 years ago

same problem , v1.9.0
gpu-manager 1.0.9 & v1.1.5

fans-of-shuey commented 1 year ago

请问,在jetson上跑起来了吗?我也遇到这个问题了