tkestack / vcuda-controller

Other
488 stars 156 forks source link

nvprof hangs in vcuda #8

Closed openqt closed 4 years ago

openqt commented 4 years ago

nvprof on vcuda hangs, log loops as following,

/tmp/cuda-control/src/hijack_call.c:307 Hijacking nvmlInit
/tmp/cuda-control/src/hijack_call.c:310 Hijacking nvmlDeviceGetHandleByIndex
/tmp/cuda-control/src/hijack_call.c:318 Hijacking nvmlDeviceGetComputeRunningProcesses
/tmp/cuda-control/src/hijack_call.c:380 pid: 14777
/tmp/cuda-control/src/hijack_call.c:380 pid: 14801
/tmp/cuda-control/src/hijack_call.c:380 pid: 14817
/tmp/cuda-control/src/hijack_call.c:385 read 3 items from /etc/vcuda/5d522d78e5429b9d305a4fbab92e203a6dd777f1dd985f30309ca907c031be5c/pids.config
/tmp/cuda-control/src/hijack_call.c:331 Hijacking nvmlDeviceGetProcessUtilization
/tmp/cuda-control/src/hijack_call.c:348 try to find 1920151404 from pid tables
/tmp/cuda-control/src/hijack_call.c:348 try to find 1601205536 from pid tables
/tmp/cuda-control/src/hijack_call.c:348 try to find 980579683 from pid tables
/tmp/cuda-control/src/hijack_call.c:348 try to find 0 from pid tables
/tmp/cuda-control/src/hijack_call.c:348 try to find 536865808 from pid tables
/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 0
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 0
/tmp/cuda-control/src/hijack_call.c:364 Hijacking nvmlShutdown
/tmp/cuda-control/src/hijack_call.c:151 delta: 2228224, curr: 2228224
/tmp/cuda-control/src/hijack_call.c:277 util: 0, up_limit: 90,  share: 2228224, cur: 2228224
/tmp/cuda-control/src/hijack_call.c:307 Hijacking nvmlInit
/tmp/cuda-control/src/hijack_call.c:310 Hijacking nvmlDeviceGetHandleByIndex
/tmp/cuda-control/src/hijack_call.c:318 Hijacking nvmlDeviceGetComputeRunningProcesses
/tmp/cuda-control/src/hijack_call.c:380 pid: 14777
/tmp/cuda-control/src/hijack_call.c:380 pid: 14801
/tmp/cuda-control/src/hijack_call.c:380 pid: 14817

The command in Dockerfile, and I build image as tensorflow/tf-mul:nvprof

FROM tensorflow/tensorflow:latest-gpu

ADD mul.py /mul.py
ENTRYPOINT ["nvprof", "python", "/mul.py"]
CMD []

The yaml is,

apiVersion: v1
kind: Pod
metadata:
  name: vcudal-prof
  labels:
    gpu-model: 2080t
spec:
  restartPolicy: Never
  enableServiceLinks: false
  containers:
  - name: test1
    image: tensorflow/tf-mul:nvprof
    securityContext:
      privileged: true
    env:
    - name: LOGGER_LEVEL
      value: "10"
    resources:
      limits:
        tencent.com/vcuda-core: 90
        tencent.com/vcuda-memory: 8

if I change to tencent.com/vcuda-core: 100 the mode in exclusive, everything is ok. Any helps are welcome!

mYmNeo commented 4 years ago

nvprof is not supported in this mode.