Closed openqt closed 4 years ago
nvprof on vcuda hangs, log loops as following,
/tmp/cuda-control/src/hijack_call.c:307 Hijacking nvmlInit /tmp/cuda-control/src/hijack_call.c:310 Hijacking nvmlDeviceGetHandleByIndex /tmp/cuda-control/src/hijack_call.c:318 Hijacking nvmlDeviceGetComputeRunningProcesses /tmp/cuda-control/src/hijack_call.c:380 pid: 14777 /tmp/cuda-control/src/hijack_call.c:380 pid: 14801 /tmp/cuda-control/src/hijack_call.c:380 pid: 14817 /tmp/cuda-control/src/hijack_call.c:385 read 3 items from /etc/vcuda/5d522d78e5429b9d305a4fbab92e203a6dd777f1dd985f30309ca907c031be5c/pids.config /tmp/cuda-control/src/hijack_call.c:331 Hijacking nvmlDeviceGetProcessUtilization /tmp/cuda-control/src/hijack_call.c:348 try to find 1920151404 from pid tables /tmp/cuda-control/src/hijack_call.c:348 try to find 1601205536 from pid tables /tmp/cuda-control/src/hijack_call.c:348 try to find 980579683 from pid tables /tmp/cuda-control/src/hijack_call.c:348 try to find 0 from pid tables /tmp/cuda-control/src/hijack_call.c:348 try to find 536865808 from pid tables /tmp/cuda-control/src/hijack_call.c:360 sys utilization: 0 /tmp/cuda-control/src/hijack_call.c:361 used utilization: 0 /tmp/cuda-control/src/hijack_call.c:364 Hijacking nvmlShutdown /tmp/cuda-control/src/hijack_call.c:151 delta: 2228224, curr: 2228224 /tmp/cuda-control/src/hijack_call.c:277 util: 0, up_limit: 90, share: 2228224, cur: 2228224 /tmp/cuda-control/src/hijack_call.c:307 Hijacking nvmlInit /tmp/cuda-control/src/hijack_call.c:310 Hijacking nvmlDeviceGetHandleByIndex /tmp/cuda-control/src/hijack_call.c:318 Hijacking nvmlDeviceGetComputeRunningProcesses /tmp/cuda-control/src/hijack_call.c:380 pid: 14777 /tmp/cuda-control/src/hijack_call.c:380 pid: 14801 /tmp/cuda-control/src/hijack_call.c:380 pid: 14817
The command in Dockerfile, and I build image as tensorflow/tf-mul:nvprof
tensorflow/tf-mul:nvprof
FROM tensorflow/tensorflow:latest-gpu ADD mul.py /mul.py ENTRYPOINT ["nvprof", "python", "/mul.py"] CMD []
The yaml is,
apiVersion: v1 kind: Pod metadata: name: vcudal-prof labels: gpu-model: 2080t spec: restartPolicy: Never enableServiceLinks: false containers: - name: test1 image: tensorflow/tf-mul:nvprof securityContext: privileged: true env: - name: LOGGER_LEVEL value: "10" resources: limits: tencent.com/vcuda-core: 90 tencent.com/vcuda-memory: 8
if I change to tencent.com/vcuda-core: 100 the mode in exclusive, everything is ok. Any helps are welcome!
tencent.com/vcuda-core: 100
nvprof is not supported in this mode.
nvprof
nvprof on vcuda hangs, log loops as following,
The command in Dockerfile, and I build image as
tensorflow/tf-mul:nvprof
The yaml is,
if I change to
tencent.com/vcuda-core: 100
the mode in exclusive, everything is ok. Any helps are welcome!