tkestack / gpu-manager

Other
832 stars 236 forks source link

Use a fraction gpu resource, fail to get response from manager #164

Open weixiujuan opened 2 years ago

weixiujuan commented 2 years ago

Please help solve the problem,The information is as follows,Thank you.

The restricted GPU configuration is as follows:

    resources:
      limits:
        tencent.com/vcuda-core: "20"
        tencent.com/vcuda-memory: "20"
      requests:
        tencent.com/vcuda-core: "20"
        tencent.com/vcuda-memory: "20"
    env:
      - name: LOGGER_LEVEL
        value: "5"

The running algorithm program reports the following error:

/tmp/cuda-control/src/loader.c:1056 config file: /etc/vcuda/kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice/vcuda.config
/tmp/cuda-control/src/loader.c:1057 pid file: /etc/vcuda/kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice/pids.config
/tmp/cuda-control/src/loader.c:1061 register to remote: pod uid: tainerd.service, cont id: kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice
F0607 15:56:33.572429     158 client.go:78] fail to get response from manager, error rpc error: code = Unknown desc = can't find kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice from docker
/tmp/cuda-control/src/register.c:87 rpc client exit with 255

gpu-manager.INFO log contents are as follows:

I0607 15:56:33.571262  626706 manager.go:369] UID: tainerd.service, cont: kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice want to registration
I0607 15:56:33.571439  626706 manager.go:455] Write /etc/gpu-manager/vm/tainerd.service/kubepods-besteffort-podbf73f491_6382_4c25_8c15_08362365ecf6.slice/pids.config
I0607 15:56:33.573392  626706 logs.go:79] transport: loopyWriter.run returning. connection error: desc = "transport is closing"

gpu-manager.WARNING log contents are as follows: W0607 15:56:44.887813 626706 manager.go:290] Find orphaned pod tainerd.service

gpu-manager.ERROR and gpu-manager.FATAL are no error log.

my gpu-manager.yaml is follwing:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: gpu-manager
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: gpu-manager-role
subjects:
- kind: ServiceAccount
  name: gpu-manager
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: cluster-admin
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpu-manager-daemonset
  namespace: kube-system
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      name: gpu-manager-ds
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: gpu-manager-ds
    spec:
      serviceAccount: gpu-manager
      tolerations:
        # This toleration is deprecated. Kept here for backward compatibility
        # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
        - key: CriticalAddonsOnly
          operator: Exists
        - key: tencent.com/vcuda-core
          operator: Exists
          effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      # only run node has gpu device
      nodeSelector:
        nvidia-device-enable: enable
      hostPID: true
      containers:
        - image: thomassong/gpu-manager:1.1.5
          imagePullPolicy: IfNotPresent
          name: gpu-manager
          securityContext:
            privileged: true
          ports:
            - containerPort: 5678
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
            - name: vdriver
              mountPath: /etc/gpu-manager/vdriver
            - name: vmdata
              mountPath: /etc/gpu-manager/vm
            - name: log
              mountPath: /var/log/gpu-manager
            - name: checkpoint
              mountPath: /etc/gpu-manager/checkpoint
            - name: run-dir
              mountPath: /var/run
            - name: cgroup
              mountPath: /sys/fs/cgroup
              readOnly: true
            - name: usr-directory
              mountPath: /usr/local/host
              readOnly: true
            - name: kube-root
              mountPath: /root/.kube
              readOnly: true
          env:
            - name: LOG_LEVEL
              value: "5"
            - name: EXTRA_FLAGS
              value: "--logtostderr=false --container-runtime-endpoint=/var/run/containerd/containerd.sock --cgroup-driver=systemd"
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
      volumes:
        - name: device-plugin
          hostPath:
            type: Directory
            path: /var/lib/kubelet/device-plugins
        - name: vmdata
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vm
        - name: vdriver
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/vdriver
        - name: log
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/log
        - name: checkpoint
          hostPath:
            type: DirectoryOrCreate
            path: /etc/gpu-manager/checkpoint
        # We have to mount the whole /var/run directory into container, because of bind mount docker.sock
        # inode change after host docker is restarted
        - name: run-dir
          hostPath:
            type: Directory
            path: /var/run
        - name: cgroup
          hostPath:
            type: Directory
            path: /sys/fs/cgroup
        # We have to mount /usr directory instead of specified library path, because of non-existing
        # problem for different distro
        - name: usr-directory
          hostPath:
            type: Directory
            path: /usr
        - name: kube-root
          hostPath:
            type: Directory
            path: /root/.kube
DennisYoung96 commented 2 years ago

same with me. did u solve it?

weixiujuan commented 2 years ago

Hi,I lowered the version of kubernetes to v1.20 and it works fine,did u solve it?

zhichenghe commented 2 years ago

we have the same issue at the kubernetes version of v1.18.6

lynnfi commented 1 year ago

the same error, I think the reason is "--container-runtime-endpoint=/var/run/containerd/containerd.sock --cgroup-driver=systemd" , use containerd as container-runtime cause this problem, i will try to solve this.

lynnfi commented 1 year ago

same with me. did u solve it?

I change the k8s cgroup from systemd to cgroup,it works well. Do not use -cgroup-driver=systemd The congfig like

env:

  • name: LOG_LEVEL value: "5"
  • name: EXTRA_FLAGS value: "--logtostderr=false --container-runtime-endpoint=/var/run/containerd/containerd.sock"
  • name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName