volcano-sh / devices

Device plugins for Volcano, e.g. GPU
Apache License 2.0
97 stars 41 forks source link

device-plugin painc #42

Closed lengrongfu closed 1 year ago

lengrongfu commented 1 year ago
I0615 08:06:10.942719       1 plugin.go:382] Allocate Response [&ContainerAllocateResponse{Envs:map[string]string{CUDA_DEVICE_MEMORY_LIMIT_0: 1024m,CUDA_DEVICE_MEMORY_SHARED_CACHE: /tmp/vgpu/6b5a834e-6fec-47d6-b629-0468cd18ba69.cache,NVIDIA_VISIBLE_DEVICES: GPU-d8794152-5506-fe60-be38-c6ff3d35dbf4,},Mounts:[]*Mount{&Mount{ContainerPath:/usr/local/vgpu/libvgpu.so,HostPath:/usr/local/vgpu/libvgpu.so,ReadOnly:true,},&Mount{ContainerPath:/etc/ld.so.preload,HostPath:/usr/local/vgpu/ld.so.preload,ReadOnly:true,},&Mount{ContainerPath:/tmp/vgpu,HostPath:/tmp/vgpu/containers/1a7defed-9fff-4feb-8921-45cc7ea253f7_vgpu2,ReadOnly:false,},&Mount{ContainerPath:/tmp/vgpulock,HostPath:/tmp/vgpulock,ReadOnly:false,},},Devices:[]*DeviceSpec{},Annotations:map[string]string{},}]
I0615 08:06:11.002096       1 util.go:229] TrySuccess:
I0615 08:06:11.002123       1 util.go:235] AllDevicesAllocateSuccess releasing lock
I0615 08:06:11.338349       1 plugin.go:309] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-d8794152-5506-fe60-be38-c6ff3d35dbf4-5],}]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1092953]

goroutine 68 [running]:
volcano.sh/k8s-device-plugin/pkg/plugin/vgpu4pd.(*NvidiaDevicePlugin).Allocate(0xc00038ec80, {0x14cfee0, 0xc0003ee510}, 0xc0005eca00)
    /go/src/volcano.sh/devices/pkg/plugin/vgpu4pd/plugin.go:326 +0x353
k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1._DevicePlugin_Allocate_Handler({0x12920a0?, 0xc00038ec80}, {0x14cfee0, 0xc0003ee510}, 0xc00062e060, 0x0)
    /go/pkg/mod/k8s.io/kubelet@v0.18.2/pkg/apis/deviceplugin/v1beta1/api.pb.go:1192 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0003a41a0, {0x14d4df8, 0xc0004cc000}, 0xc0000b2100, 0xc000367aa0, 0x1ce57f8, 0x0)
    /go/pkg/mod/google.golang.org/grpc@v1.29.0/server.go:1082 +0xcab
google.golang.org/grpc.(*Server).handleStream(0xc0003a41a0, {0x14d4df8, 0xc0004cc000}, 0xc0000b2100, 0x0)
    /go/pkg/mod/google.golang.org/grpc@v1.29.0/server.go:1405 +0xa13
google.golang.org/grpc.(*Server).serveStreams.func1.1()
    /go/pkg/mod/google.golang.org/grpc@v1.29.0/server.go:746 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /go/pkg/mod/google.golang.org/grpc@v1.29.0/server.go:744 +0xea
lengrongfu commented 1 year ago

@archlitchi Can you help me? After code location, I found that it is The patch pod is successful but the data is not written to pod.

https://github.com/volcano-sh/volcano/blob/2721275aebfc26ac2c0e8ff4afeda1dbd852061c/pkg/scheduler/api/devices/nvidia/vgpu/device_info.go#L219

$ kubectl get pods vgpu-deploy2-758579bd97-7cwbv -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduling.k8s.io/group-name: podgroup-4731d0a9-c9c8-4f95-b1ac-89918d660385
    volcano.sh/resource-group: cpu
  creationTimestamp: "2023-06-16T09:10:46Z"
  generateName: vgpu-deploy2-758579bd97-
  labels:
    app: vgpu2
    pod-template-hash: 758579bd97
  name: vgpu-deploy2-758579bd97-7cwbv
  namespace: default
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: vgpu-deploy2-758579bd97
    uid: 4731d0a9-c9c8-4f95-b1ac-89918d660385
  resourceVersion: "535286"
  uid: 35197671-900b-45d0-8fed-a7732308130e
spec:
  containers:
  - args:
    - "6000"
    image: chrstnhntschl/gpu_burn
    imagePullPolicy: Always
    name: vgpu2
    resources:
      limits:
        volcano.sh/vgpu-memory: "1024"
        volcano.sh/vgpu-number: "1"
      requests:
        volcano.sh/vgpu-memory: "1024"
        volcano.sh/vgpu-number: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-dfc6z
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: 10-29-4-48
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: volcano
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-dfc6z
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  message: Pod Allocate failed due to no containers return in allocation response
    &AllocateResponse{ContainerResponses:[]*ContainerAllocateResponse{},}, which is
    unexpected
  phase: Failed
  reason: UnexpectedAdmissionError
  startTime: "2023-06-16T09:12:13Z"

apply deployment content:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vgpu-deploy2
  annotations:
    volcano.sh/resource-group: cpu
  labels:
    app: vgpu2
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vgpu2
  template:
    metadata:
      annotations:
        volcano.sh/resource-group: cpu
      labels:
        app: vgpu2
    spec:
      #schedulerName: volcano
      containers:
      - name: vgpu2
        image: chrstnhntschl/gpu_burn
        args:
          - "6000"
        resources:
          limits:
            volcano.sh/vgpu-number: 1
            volcano.sh/vgpu-memory: 1024
archlitchi commented 1 year ago

can you run example in vGPU mode correctly?

lengrongfu commented 1 year ago

I use yaml to apply.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod12
spec:
  schedulerName: volcano
  containers:
    - name: ubuntu-container
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          volcano.sh/vgpu-number: 1 # requesting 2 vGPUs
          volcano.sh/vgpu-memory: 2000
          #volcano.sh/vgpu-memory-percentage: 50 #Each vGPU containers 50% device memory of that GPU. Can not be used with nvidia.com/gpumem
    - name: ubuntu-container0
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
    - name: ubuntu-container1
      image: ubuntu:18.04
      command: ["bash", "-c", "sleep 86400"]
      resources:
        limits:
          volcano.sh/vgpu-number: 1 # requesting 2 vGPUs
          volcano.sh/vgpu-memory: 3000
$ kubectl get pods
NAME                          READY   STATUS                     RESTARTS   AGE
gpu-pod12                     0/3     UnexpectedAdmissionError   0          7s
archlitchi commented 1 year ago

that's weird, if the annotation patch step is unsuccessful, then it should return an error and fail the schedule

lengrongfu commented 1 year ago

Is it possible that the submitted path action kube-apiserver failed to execute?

archlitchi commented 1 year ago

Is it possible that the submitted path action kube-apiserver failed to execute?

can you show me the log of "vgpu-device-plugin" regarding the example you just submitted?

xjxtree commented 1 year ago

same errors for me while using vgpu plugin. any actions can help us?

archlitchi commented 1 year ago

same errors for me while using vgpu plugin. any actions can help us?

check if you have "NodeName" field in your task, if you do ,please use "nodeselector" instead. If you don't please show me the log of device-plugin

xjxtree commented 1 year ago

this is vcjob yaml:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: tensorflow-dist-mnist
spec:
  minAvailable: 3
  schedulerName: volcano
  plugins:
    env: []
    svc: []
  policies:
    - event: PodEvicted
      action: RestartJob
  queue: ai
  tasks:
    - replicas: 1
      name: ps
      template:
        spec:
          containers:
            - command:
                - sh
                - -c
                - |
                  PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
                  WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
                  export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"ps\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
                  python /var/tf_dist_mnist/dist_mnist.py
              image: volcanosh/dist-mnist-tf-example:0.0.1
              name: tensorflow
              ports:
                - containerPort: 2222
                  name: tfjob-port
          restartPolicy: Never
    - replicas: 2
      name: worker
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - command:
                - sh
                - -c
                - |
                  PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
                  WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | sed 's/^/"/;s/$/"/' | tr "\n" ","`;
                  export TF_CONFIG={\"cluster\":{\"ps\":[${PS_HOST}],\"worker\":[${WORKER_HOST}]},\"task\":{\"type\":\"worker\",\"index\":${VK_TASK_INDEX}},\"environment\":\"cloud\"};
                  python /var/tf_dist_mnist/dist_mnist.py
              image: volcanosh/dist-mnist-tf-example:0.0.1
              name: tensorflow
              ports:
                - containerPort: 2222
                  name: tfjob-port
              resources:
                limits:
                  volcano.sh/vgpu-number: 2
          tolerations:
          - effect: NoSchedule
            key: volcano.sh/gpu
            operator: Exists
          restartPolicy: Never
xjxtree commented 1 year ago

and I don't find gpu memory info in node describe:

Capacity:
  cpu:                           8
  ephemeral-storage:             51539404Ki
  hugepages-1Gi:                 0
  hugepages-2Mi:                 0
  memory:                        40823904Ki
  nvidia.com/gpu:                1
  pods:                          95
  tke.cloud.tencent.com/eip:     2
  tke.cloud.tencent.com/eni-ip:  95
  volcano.sh/vgpu-number:        10
Allocatable:
  cpu:                           7800m
  ephemeral-storage:             47498714648
  hugepages-1Gi:                 0
  hugepages-2Mi:                 0
  memory:                        36486240Ki
  nvidia.com/gpu:                1
  pods:                          95
  tke.cloud.tencent.com/eip:     2
  tke.cloud.tencent.com/eni-ip:  6
  volcano.sh/vgpu-number:        10

here are node volcano-device-plugin pod stdout log

I0706 08:05:20.344025       1 main.go:77] Loading NVML
I0706 08:05:20.347260       1 main.go:91] Starting FS watcher.
I0706 08:05:20.347441       1 main.go:98] Starting OS watcher.
I0706 08:05:20.354087       1 main.go:116] Retreiving plugins.
I0706 08:05:20.354152       1 register.go:101] into WatchAndRegister
2023/07/06 08:05:20 Starting GRPC server for 'volcano.sh/vgpu-number'
2023/07/06 08:05:20 Starting to serve 'volcano.sh/vgpu-number' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2023/07/06 08:05:20 Registered device plugin for 'volcano.sh/vgpu-number' with Kubelet
I0706 08:05:20.371750       1 register.go:89] Reporting devices GPU-f0803cfd-7b91-b063-5b9d-50c146197a89,10,32510,NVIDIA-Tesla V100-SXM2-32GB,false: in 2023-07-06 08:05:20.371740679 +0000 UTC m=+0.033841141
I0706 08:05:50.404241       1 register.go:89] Reporting devices GPU-f0803cfd-7b91-b063-5b9d-50c146197a89,10,32510,NVIDIA-Tesla V100-SXM2-32GB,false: in 2023-07-06 08:05:50.404232127 +0000 UTC m=+30.066332579
I0706 08:06:20.438079       1 register.go:89] Reporting devices GPU-f0803cfd-7b91-b063-5b9d-50c146197a89,10,32510,NVIDIA-Tesla V100-SXM2-32GB,false: in 2023-07-06 08:06:20.438070059 +0000 UTC m=+60.100170526
I0706 08:06:50.469563       1 register.go:89] Reporting devices GPU-f0803cfd-7b91-b063-5b9d-50c146197a89,10,32510,NVIDIA-Tesla V100-SXM2-32GB,false: in 2023-07-06 08:06:50.469554195 +0000 UTC m=+90.131654661

when I apply vcjob to allocate vgpu-num volcano-device-plugin pod whill restart and log error:

I0706 08:03:26.139587       1 main.go:77] Loading NVML
I0706 08:03:26.143063       1 main.go:91] Starting FS watcher.
I0706 08:03:26.143258       1 main.go:98] Starting OS watcher.
I0706 08:03:26.149805       1 main.go:116] Retreiving plugins.
I0706 08:03:26.150118       1 register.go:101] into WatchAndRegister
2023/07/06 08:03:26 Starting GRPC server for 'volcano.sh/vgpu-number'
2023/07/06 08:03:26 Starting to serve 'volcano.sh/vgpu-number' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2023/07/06 08:03:26 Registered device plugin for 'volcano.sh/vgpu-number' with Kubelet
I0706 08:03:26.178483       1 register.go:89] Reporting devices GPU-f0803cfd-7b91-b063-5b9d-50c146197a89,10,32510,NVIDIA-Tesla V100-SXM2-32GB,false: in 2023-07-06 08:03:26.178475867 +0000 UTC m=+0.046181881
I0706 08:03:56.214284       1 register.go:89] Reporting devices GPU-f0803cfd-7b91-b063-5b9d-50c146197a89,10,32510,NVIDIA-Tesla V100-SXM2-32GB,false: in 2023-07-06 08:03:56.214274187 +0000 UTC m=+30.081980204
I0706 08:04:26.261105       1 register.go:89] Reporting devices GPU-f0803cfd-7b91-b063-5b9d-50c146197a89,10,32510,NVIDIA-Tesla V100-SXM2-32GB,false: in 2023-07-06 08:04:26.261095335 +0000 UTC m=+60.128801381
I0706 08:04:56.293621       1 register.go:89] Reporting devices GPU-f0803cfd-7b91-b063-5b9d-50c146197a89,10,32510,NVIDIA-Tesla V100-SXM2-32GB,false: in 2023-07-06 08:04:56.293612407 +0000 UTC m=+90.161318431
I0706 08:05:19.926804       1 plugin.go:309] Allocate [&ContainerAllocateRequest{DevicesIDs:[GPU-f0803cfd-7b91-b063-5b9d-50c146197a89-4 GPU-f0803cfd-7b91-b063-5b9d-50c146197a89-0],}]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1092953]

goroutine 15 [running]:
volcano.sh/k8s-device-plugin/pkg/plugin/vgpu4pd.(*NvidiaDevicePlugin).Allocate(0xc0001420a0, {0x14cfee0, 0xc0003311a0}, 0xc000040be0)
    /go/src/volcano.sh/devices/pkg/plugin/vgpu4pd/plugin.go:326 +0x353
k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1._DevicePlugin_Allocate_Handler({0x12920a0?, 0xc0001420a0}, {0x14cfee0, 0xc0003311a0}, 0xc0002eac60, 0x0)
    /go/pkg/mod/k8s.io/kubelet@v0.18.2/pkg/apis/deviceplugin/v1beta1/api.pb.go:1192 +0x170
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000103040, {0x14d4df8, 0xc00053c000}, 0xc000443200, 0xc000128180, 0x1ce57f8, 0x0)
    /go/pkg/mod/google.golang.org/grpc@v1.29.0/server.go:1082 +0xcab
google.golang.org/grpc.(*Server).handleStream(0xc000103040, {0x14d4df8, 0xc00053c000}, 0xc000443200, 0x0)
    /go/pkg/mod/google.golang.org/grpc@v1.29.0/server.go:1405 +0xa13
google.golang.org/grpc.(*Server).serveStreams.func1.1()
    /go/pkg/mod/google.golang.org/grpc@v1.29.0/server.go:746 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /go/pkg/mod/google.golang.org/grpc@v1.29.0/server.go:744 +0xea
archlitchi commented 1 year ago

@nabanbaba can you restart device-plugin, and try example.yaml? see if it works.

xjxtree commented 1 year ago

after resatrt plugin ds use example run a pod alse same error. pod yaml:

apiVersion: v1
kind: Pod
metadata:
  name: pod1
spec:
  restartPolicy: OnFailure
  schedulerName: volcano
  tolerations:
  - effect: NoSchedule
    key: volcano.sh/gpu
    operator: Exists
  containers:
  - image: nvidia/cuda:10.1-base-ubuntu18.04
    name: pod1-ctr
    command: ["sleep"]
    args: ["100000"]
    resources:
      limits:
        volcano.sh/vgpu-number: 1
xjxtree commented 1 year ago

and ... device plugin not found volcano.sh/vgpu-memory resource. node info not have this resource.

archlitchi commented 1 year ago

that is normal, because gpumem is more like a parameter for "vgpu" rather than a standalone device resource. And it should be ignored by scheduler.

archlitchi commented 1 year ago

after resatrt plugin ds use example run a pod alse same error. pod yaml:

apiVersion: v1
kind: Pod
metadata:
  name: pod1
spec:
  restartPolicy: OnFailure
  schedulerName: volcano
  tolerations:
  - effect: NoSchedule
    key: volcano.sh/gpu
    operator: Exists
  containers:
  - image: nvidia/cuda:10.1-base-ubuntu18.04
    name: pod1-ctr
    command: ["sleep"]
    args: ["100000"]
    resources:
      limits:
        volcano.sh/vgpu-number: 1

please describe this pod, see if annotations is properly modified by scheduler.

xjxtree commented 1 year ago

describe info:

+ kubectl describe pods pod1
Name:             pod1
Namespace:        volcano-system
Priority:         0
Service Account:  default
Node:             10.122.2.3/
Start Time:       Thu, 06 Jul 2023 16:59:05 +0800
Labels:           <none>
Annotations:      scheduling.k8s.io/group-name: podgroup-623f0e8a-364a-4dd1-8da3-0c0e9ef55239
Status:           Failed
Reason:           UnexpectedAdmissionError
Message:          Pod was rejected: Allocate failed due to rpc error: code = Unavailable desc = error reading from server: EOF, which is unexpected
IP:
IPs:              <none>
Containers:
  pod1-ctr:
    Image:      nvidia/cuda:10.1-base-ubuntu18.04
    Port:       <none>
    Host Port:  <none>
    Command:
      sleep
    Args:
      100000
    Limits:
      tke.cloud.tencent.com/eni-ip:  1
      volcano.sh/vgpu-number:        2
    Requests:
      tke.cloud.tencent.com/eni-ip:  1
      volcano.sh/vgpu-number:        2
    Environment:                     <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r89v6 (ro)
Volumes:
  kube-api-access-r89v6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             volcano.sh/gpu:NoSchedule op=Exists
Events:
  Type     Reason                    Age   From     Message
  ----     ------                    ----  ----     -------
  Normal   Scheduled                 6s    volcano  Successfully assigned volcano-system/pod1 to 10.122.2.3
  Warning  UnexpectedAdmissionError  6s    kubelet  Allocate failed due to rpc error: code = Unavailable desc = error reading from server: EOF, which is unexpected
xjxtree commented 1 year ago

so, can you suggest me , Should I go back to the old version of GPU shared plugin

archlitchi commented 1 year ago

so, can you suggest me , Should I go back to the old version of GPU shared plugin

Can you add my wechat "xuanzong4493" for further inspection?

xjxtree commented 1 year ago

已发送邀请

archlitchi commented 1 year ago

solved, you need to download vc-scheduler:latest instead of vc-scheduler:1.7.0, because vgpu is a new feature for v1.8, and is not embedded in v1.7