TensorFlow program hung When I use a fraction gpu resource

pokerfaceSad commented 4 years ago

pod.yaml

Here is my .yaml file for creating pod

apiVersion: v1
kind: Pod
metadata:
  name: tf-vcuda-pod
spec:
  restartPolicy: Never
  hostNetwork: true
  containers:
  - image: tensorflow/tensorflow:1.13.1-gpu-py3
    name: tensorflow-vcuda-test
    command: ["/bin/bash", "-ce", "tail -f /dev/null"]
    volumeMounts:
          - mountPath: /home/gpu
            name: tf-code
    resources:
      requests:
        tencent.com/vcuda-core: 90
        tencent.com/vcuda-memory: 30
      limits:
        tencent.com/vcuda-core: 90
        tencent.com/vcuda-memory: 30

training code

Here is my tensorflow code, just a simple CNN

import tensorflow as tf
from numpy.random import RandomState

batch_size = 8

w1 = tf.Variable(tf.random_normal([2,3],stddev=1,seed=1))
w2 = tf.Variable(tf.random_normal([3,1],stddev=1,seed=1))

x = tf.placeholder(tf.float32,shape=(None,2),name='x-input')
y_ = tf.placeholder(tf.float32,shape=(None,1),name='y-input')

a = tf.matmul(x,w1)
y = tf.matmul(a,w2)

cross_entropy = -tf.reduce_mean(y_ * tf.log(tf.clip_by_value(y,1e-10,1.0)))
train_step = tf.train.AdamOptimizer(0.001).minimize(cross_entropy)

rdm = RandomState(1)
dataset_size = 128000
X = rdm.rand(dataset_size,2)
Y = [[int(x1+x2 < 1)] for (x1,x2) in X]

with tf.Session() as sess:
    init_op = tf.global_variables_initializer()
    sess.run(init_op)

    print(sess.run(w1))
    print(sess.run(w2))

    STEPS = 900000
    for i in range(STEPS):
        start = (i * batch_size) % dataset_size
        end = min(start+batch_size,dataset_size)

        sess.run(train_step,feed_dict={x:X[start:end],y_:Y[start:end]})

        if i%1000 == 0:
            total_cross_entropy = sess.run(cross_entropy,feed_dict={x:X,y_:Y})
            print("After %d training step(s),cross entropy on all data is %g" % (i,total_cross_entropy))

    print(sess.run(w1))
    print(sess.run(w2))

problem

The program hung up after output Created TensorFlow device

$ python CNN_TensorFlow.py 
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-07-08 02:28:04.238654: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-08 02:28:04.416921: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x41c9670 executing computations on platform CUDA. Devices:
2020-07-08 02:28:04.416985: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-07-08 02:28:04.422450: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2599995000 Hz
2020-07-08 02:28:04.427351: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x431e600 executing computations on platform Host. Devices:
2020-07-08 02:28:04.427406: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-07-08 02:28:04.438859: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:05:00.0
totalMemory: 11.75GiB freeMemory: 11.69GiB
2020-07-08 02:28:04.438910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-07-08 02:28:04.443735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-08 02:28:04.443779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-07-08 02:28:04.443794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-07-08 02:28:04.452388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11376 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:05:00.0, compute capability: 3.7)

log

Here is the log, it repeat output Hijacking nvml...

python CNN_TensorFlow.py 
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-07-08 02:31:11.106661: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
/tmp/cuda-control/src/loader.c:941 config file: /etc/vcuda/a2371bd1e50fd3d2fe175bdeeb21df2727149cf29d4ed12f4fd3fc737fb7f163/vcuda.config
/tmp/cuda-control/src/loader.c:942 pid file: /etc/vcuda/a2371bd1e50fd3d2fe175bdeeb21df2727149cf29d4ed12f4fd3fc737fb7f163/pids.config
/tmp/cuda-control/src/loader.c:946 register to remote: pod uid: 24993e70-c0c2-11ea-97bf-40167e346bb0, cont id: a2371bd1e50fd3d2fe175bdeeb21df2727149cf29d4ed12f4fd3fc737fb7f163
/tmp/cuda-control/src/loader.c:1044 pod uid          : 24993e70-c0c2-11ea-97bf-40167e346bb0
/tmp/cuda-control/src/loader.c:1045 limit            : 0
/tmp/cuda-control/src/loader.c:1046 container name   : tensorflow-vcuda-test
/tmp/cuda-control/src/loader.c:1047 total utilization: 90
/tmp/cuda-control/src/loader.c:1048 total gpu memory : 12616466432
/tmp/cuda-control/src/loader.c:1049 driver version   : 418.39
/tmp/cuda-control/src/loader.c:1050 hard limit mode  : 1
/tmp/cuda-control/src/loader.c:1051 enable mode      : 1
/tmp/cuda-control/src/loader.c:767 Start hijacking
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuEGLInit
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuDeviceGetNvSciSyncAttributes
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuGraphExecHostNodeSetParams
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuGraphExecMemcpyNodeSetParams
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuGraphExecMemsetNodeSetParams
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuGraphExecUpdate
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemAddressFree
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemAddressReserve
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemCreate
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemExportToShareableHandle
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemGetAccess
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemGetAllocationGranularity
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemGetAllocationPropertiesFromHandle
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemImportFromShareableHandle
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemMap
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemRelease
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemSetAccess
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemUnmap
/tmp/cuda-control/src/loader.c:733 can't find function libnvidia-ml.so.418.39 in nvmlDeviceGetGridLicensableFeatures_v3
/tmp/cuda-control/src/loader.c:733 can't find function libnvidia-ml.so.418.39 in nvmlDeviceGetHostVgpuMode
/tmp/cuda-control/src/loader.c:733 can't find function libnvidia-ml.so.418.39 in nvmlDeviceGetPgpuMetadataString
/tmp/cuda-control/src/loader.c:733 can't find function libnvidia-ml.so.418.39 in nvmlVgpuInstanceGetEccMode
/tmp/cuda-control/src/hijack_call.c:500 total cuda cores: 851968
/tmp/cuda-control/src/hijack_call.c:217 start utilization_watcher
/tmp/cuda-control/src/hijack_call.c:218 sm: 13, thread per sm: 2048
/tmp/cuda-control/src/loader.c:1044 pod uid          : 24993e70-c0c2-11ea-97bf-40167e346bb0
/tmp/cuda-control/src/loader.c:1045 limit            : 0
/tmp/cuda-control/src/loader.c:1046 container name   : tensorflow-vcuda-test
/tmp/cuda-control/src/loader.c:1047 total utilization: 90
/tmp/cuda-control/src/loader.c:1048 total gpu memory : 12616466432
/tmp/cuda-control/src/loader.c:1049 driver version   : 418.39
/tmp/cuda-control/src/loader.c:1050 hard limit mode  : 1
/tmp/cuda-control/src/loader.c:1051 enable mode      : 1
2020-07-08 02:31:11.291318: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4f81f30 executing computations on platform CUDA. Devices:
2020-07-08 02:31:11.291408: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-07-08 02:31:11.296666: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2599995000 Hz
2020-07-08 02:31:11.302402: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x50d6ed0 executing computations on platform Host. Devices:
2020-07-08 02:31:11.302447: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
/tmp/cuda-control/src/hijack_call.c:399 Hijacking nvmlInit

/tmp/cuda-control/src/hijack_call.c:402 Hijacking nvmlDeviceGetHandleByIndex

/tmp/cuda-control/src/hijack_call.c:412 Hijacking nvmlDeviceGetComputeRunningProcesses

/tmp/cuda-control/src/hijack_call.c:425 summary: 27275 used 58982400
/tmp/cuda-control/src/hijack_call.c:432 27275 use memory: 58982400
/tmp/cuda-control/src/hijack_call.c:437 total used memory: 58982400
/tmp/cuda-control/src/hijack_call.c:440 Hijacking nvmlShutdown

2020-07-08 02:31:11.313704: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:05:00.0
totalMemory: 11.75GiB freeMemory: 11.69GiB
2020-07-08 02:31:11.313754: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
/tmp/cuda-control/src/loader.c:1044 pod uid          : 24993e70-c0c2-11ea-97bf-40167e346bb0
/tmp/cuda-control/src/loader.c:1045 limit            : 0
/tmp/cuda-control/src/loader.c:1046 container name   : tensorflow-vcuda-test
/tmp/cuda-control/src/loader.c:1047 total utilization: 90
/tmp/cuda-control/src/loader.c:1048 total gpu memory : 12616466432
/tmp/cuda-control/src/loader.c:1049 driver version   : 418.39
/tmp/cuda-control/src/loader.c:1050 hard limit mode  : 1
/tmp/cuda-control/src/loader.c:1051 enable mode      : 1
/tmp/cuda-control/src/loader.c:1044 pod uid          : 24993e70-c0c2-11ea-97bf-40167e346bb0
/tmp/cuda-control/src/loader.c:1045 limit            : 0
/tmp/cuda-control/src/loader.c:1046 container name   : tensorflow-vcuda-test
/tmp/cuda-control/src/loader.c:1047 total utilization: 90
/tmp/cuda-control/src/loader.c:1048 total gpu memory : 12616466432
/tmp/cuda-control/src/loader.c:1049 driver version   : 418.39
/tmp/cuda-control/src/loader.c:1050 hard limit mode  : 1
/tmp/cuda-control/src/loader.c:1051 enable mode      : 1
2020-07-08 02:31:11.315833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-08 02:31:11.315868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-07-08 02:31:11.315882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
/tmp/cuda-control/src/hijack_call.c:399 Hijacking nvmlInit

/tmp/cuda-control/src/hijack_call.c:307 Hijacking nvmlInit

/tmp/cuda-control/src/hijack_call.c:310 Hijacking nvmlDeviceGetHandleByIndex

/tmp/cuda-control/src/hijack_call.c:402 Hijacking nvmlDeviceGetHandleByIndex

/tmp/cuda-control/src/hijack_call.c:318 Hijacking nvmlDeviceGetComputeRunningProcesses

/tmp/cuda-control/src/hijack_call.c:412 Hijacking nvmlDeviceGetComputeRunningProcesses

/tmp/cuda-control/src/hijack_call.c:425 summary: 27275 used 58982400
/tmp/cuda-control/src/hijack_call.c:432 27275 use memory: 58982400
/tmp/cuda-control/src/hijack_call.c:437 total used memory: 58982400
/tmp/cuda-control/src/hijack_call.c:440 Hijacking nvmlShutdown

/tmp/cuda-control/src/hijack_call.c:331 Hijacking nvmlDeviceGetProcessUtilization

/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 0
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 0
/tmp/cuda-control/src/hijack_call.c:364 Hijacking nvmlShutdown

2020-07-08 02:31:11.324331: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11376 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:05:00.0, compute capability: 3.7)
/tmp/cuda-control/src/hijack_call.c:307 Hijacking nvmlInit

/tmp/cuda-control/src/hijack_call.c:310 Hijacking nvmlDeviceGetHandleByIndex

/tmp/cuda-control/src/hijack_call.c:318 Hijacking nvmlDeviceGetComputeRunningProcesses

/tmp/cuda-control/src/hijack_call.c:331 Hijacking nvmlDeviceGetProcessUtilization

/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 0
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 0
/tmp/cuda-control/src/hijack_call.c:364 Hijacking nvmlShutdown

/tmp/cuda-control/src/hijack_call.c:307 Hijacking nvmlInit

/tmp/cuda-control/src/hijack_call.c:310 Hijacking nvmlDeviceGetHandleByIndex

/tmp/cuda-control/src/hijack_call.c:318 Hijacking nvmlDeviceGetComputeRunningProcesses

/tmp/cuda-control/src/hijack_call.c:331 Hijacking nvmlDeviceGetProcessUtilization

/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 0
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 0
/tmp/cuda-control/src/hijack_call.c:364 Hijacking nvmlShutdown

pokerfaceSad commented 4 years ago

I find it will hung if call tf.Session()

mYmNeo commented 4 years ago

Can you provide processes the stack /proc//stack?

pokerfaceSad commented 4 years ago

/proc//stack

Here it is

$ sudo cat /proc/19384/stack
[<ffffffff81104de2>] futex_wait_queue_me+0xc2/0x120
[<ffffffff81105936>] futex_wait+0x116/0x280
[<ffffffff81107dd6>] do_futex+0x126/0x540
[<ffffffff81108271>] SyS_futex+0x81/0x180
[<ffffffff8185328e>] entry_SYSCALL_64_fastpath+0x22/0xc1
[<ffffffffffffffff>] 0xffffffffffffffff

mYmNeo commented 4 years ago

What's the version of your gpu-manager? Did you run your program in non-root mode?

pokerfaceSad commented 4 years ago

What's the version of your gpu-manager? Did you run your program in non-root mode?

The gpu-manager-daemonset image is tkestack/gpu-manager:1.1.0.
I use tensorflow/tensorflow:1.13.1-gpu-py3 image to run tf code, it runs as root mode by default. Will it matter?

mYmNeo commented 4 years ago

What's the version of your gpu-manager? Did you run your program in non-root mode?

The gpu-manager-daemonset image is tkestack/gpu-manager:1.1.0.

I use tensorflow/tensorflow:1.13.1-gpu-py3 image to run tf code, it runs as root mode by default. Will it matter?

It's weird. I've run the your example python script and it works.

Can you provide these information?

Container runtime
ps aux|grep "gpu-"
gpu-manager.yaml

pokerfaceSad commented 4 years ago

What's the version of your gpu-manager? Did you run your program in non-root mode?

The gpu-manager-daemonset image is tkestack/gpu-manager:1.1.0.

I use tensorflow/tensorflow:1.13.1-gpu-py3 image to run tf code, it runs as root mode by default. Will it matter?

It's weird. I've run the your example python script and it works.

Can you provide these information?

Container runtime

ps aux|grep "gpu-"

gpu-manager.yaml

container runtime is nvidia-container-runtime

ps aux | grep "gpu-" ON HOST

root     38186  3.2  0.0 1420780 34864 ?       Sl   10:10   0:00 /usr/bin/gpu-manager --extra-config=/etc/gpu-manager/extra-config.json --v=5 --hostname-override=gpu42 --share-mode=true --volume-config=/etc/gpu-manager/volume.conf --log-dir=/var/log/gpu-manager --query-addr=0.0.0.0 --logtostderr=false

gpu-manager.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gpu-manager-daemonset
namespace: kube-system
spec:
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
  name: gpu-manager-ds
template:
metadata:
  # This annotation is deprecated. Kept here for backward compatibility
  # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
  annotations:
    scheduler.alpha.kubernetes.io/critical-pod: ""
  labels:
    name: gpu-manager-ds
spec:
  serviceAccount: gpu-manager
  tolerations:
    # This toleration is deprecated. Kept here for backward compatibility
    # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
    - key: CriticalAddonsOnly
      operator: Exists
    - key: tencent.com/vcuda-core
      operator: Exists
      effect: NoSchedule
  # Mark this pod as a critical add-on; when enabled, the critical add-on
  # scheduler reserves resources for critical add-on pods so that they can
  # be rescheduled after a failure.
  # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
  priorityClassName: "system-node-critical"
  # only run node hash gpu device
  nodeSelector:
    nvidia-device-enable: enable
  hostPID: true
  containers:
    - image: tkestack/gpu-manager:1.1.0
      imagePullPolicy: IfNotPresent
      name: gpu-manager
      securityContext:
        privileged: true
      ports:
        - containerPort: 5678
      volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
        - name: vdriver
          mountPath: /etc/gpu-manager/vdriver
        - name: vmdata
          mountPath: /etc/gpu-manager/vm
        - name: log
          mountPath: /var/log/gpu-manager
        - name: run-dir
          mountPath: /var/run
        - name: cgroup
          mountPath: /sys/fs/cgroup
          readOnly: true
        - name: usr-directory
          mountPath: /usr/local/host
          readOnly: true
      env:
        - name: LOG_LEVEL
          value: "5"
        - name: EXTRA_FLAGS
          value: "--logtostderr=false"
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
  volumes:
    - name: device-plugin
      hostPath:
        type: Directory
        path: /var/lib/kubelet/device-plugins
    - name: vmdata
      hostPath:
        type: DirectoryOrCreate
        path: /etc/gpu-manager/vm
    - name: vdriver
      hostPath:
        type: DirectoryOrCreate
        path: /etc/gpu-manager/vdriver
    - name: log
      hostPath:
        type: DirectoryOrCreate
        path: /etc/gpu-manager/log
    # We have to mount the whole /var/run directory into container, because of bind mount docker.sock
    # inode change after host docker is restarted
    - name: run-dir
      hostPath:
        type: Directory
        path: /var/run
    - name: cgroup
      hostPath:
        type: Directory
        path: /sys/fs/cgroup
    # We have to mount /usr directory instead of specified library path, because of non-existing
    # problem for different distro
    - name: usr-directory
      hostPath:
        type: Directory
        path: /usr

pokerfaceSad commented 4 years ago

@mYmNeo I try the same config on another host, it works well. The difference is the gpu type and gpu driver version. the Tesla K80 & Driver Version: 418.39 & CUDA Version: 10.1 one doesn't work( works well if not use vcuda-controller), the Tesla V100 & Driver Version: 418.87.01 & CUDA Version: 10.1 one works well.

I'm not sure if gpu type and gpu driver version lead this problem.

pokerfaceSad commented 4 years ago

@mYmNeo I think it stuck in this this loop, because it can't get correct gpu utilization for some reasons.

mYmNeo commented 4 years ago

@mYmNeo I try the same config on another host, it works well. The difference is the gpu type and gpu driver version. the Tesla K80 & Driver Version: 418.39 & CUDA Version: 10.1 one doesn't work( works well if not use vcuda-controller), the Tesla V100 & Driver Version: 418.87.01 & CUDA Version: 10.1 one works well.

I'm not sure if gpu type and gpu driver version lead this problem.

Sorry，I didn't noticed that your card is K series. The driver api which NVIDIA provides about utilization doesn't work on K series cards. I'll add a notice in README to mention this scenario.

pokerfaceSad commented 4 years ago

@mYmNeo I try the same config on another host, it works well. The difference is the gpu type and gpu driver version. the Tesla K80 & Driver Version: 418.39 & CUDA Version: 10.1 one doesn't work( works well if not use vcuda-controller), the Tesla V100 & Driver Version: 418.87.01 & CUDA Version: 10.1 one works well. I'm not sure if gpu type and gpu driver version lead this problem.

Sorry，I didn't noticed that your card is K series. The driver api which NVIDIA provides about utilization doesn't work on K series cards. I'll add a notice in README to mention this scenario.

Great, thx for your reply for two weeks , we finally found the cause of the problem 🤣🤣. By the way, do gpu-manager and vcuda-controller accept code contributions from third-party developers? I think they are good projects, may be I can do something.

mYmNeo commented 4 years ago

@mYmNeo I try the same config on another host, it works well. The difference is the gpu type and gpu driver version. the Tesla K80 & Driver Version: 418.39 & CUDA Version: 10.1 one doesn't work( works well if not use vcuda-controller), the Tesla V100 & Driver Version: 418.87.01 & CUDA Version: 10.1 one works well. I'm not sure if gpu type and gpu driver version lead this problem.

Sorry，I didn't noticed that your card is K series. The driver api which NVIDIA provides about utilization doesn't work on K series cards. I'll add a notice in README to mention this scenario.

Great, thx for your reply for two weeks , we finally found the cause of the problem 🤣🤣. By the way, do gpu-manager and vcuda-controller accept code contributions from third-party developers? I think they are good projects, may be I can do something.

Definitely yes. Welcome to any fascinating ideas and PR.

pokerfaceSad commented 4 years ago

@mYmNeo I try the same config on another host, it works well. The difference is the gpu type and gpu driver version. the Tesla K80 & Driver Version: 418.39 & CUDA Version: 10.1 one doesn't work( works well if not use vcuda-controller), the Tesla V100 & Driver Version: 418.87.01 & CUDA Version: 10.1 one works well. I'm not sure if gpu type and gpu driver version lead this problem.

Sorry，I didn't noticed that your card is K series. The driver api which NVIDIA provides about utilization doesn't work on K series cards. I'll add a notice in README to mention this scenario.

Great, thx for your reply for two weeks , we finally found the cause of the problem 🤣🤣. By the way, do gpu-manager and vcuda-controller accept code contributions from third-party developers? I think they are good projects, may be I can do something.

Definitely yes. Welcome to any fascinating ideas and PR.

Great, I will close this issue.

tkestack / gpu-manager