tkestack / gpu-manager

Other
817 stars 234 forks source link

TensorFlow program hung When I use a fraction gpu resource #35

Closed pokerfaceSad closed 4 years ago

pokerfaceSad commented 4 years ago

pod.yaml

Here is my .yaml file for creating pod

apiVersion: v1
kind: Pod
metadata:
  name: tf-vcuda-pod
spec:
  restartPolicy: Never
  hostNetwork: true
  containers:
  - image: tensorflow/tensorflow:1.13.1-gpu-py3
    name: tensorflow-vcuda-test
    command: ["/bin/bash", "-ce", "tail -f /dev/null"]
    volumeMounts:
          - mountPath: /home/gpu
            name: tf-code
    resources:
      requests:
        tencent.com/vcuda-core: 90
        tencent.com/vcuda-memory: 30
      limits:
        tencent.com/vcuda-core: 90
        tencent.com/vcuda-memory: 30

training code

Here is my tensorflow code, just a simple CNN

import tensorflow as tf
from numpy.random import RandomState

batch_size = 8

w1 = tf.Variable(tf.random_normal([2,3],stddev=1,seed=1))
w2 = tf.Variable(tf.random_normal([3,1],stddev=1,seed=1))

x = tf.placeholder(tf.float32,shape=(None,2),name='x-input')
y_ = tf.placeholder(tf.float32,shape=(None,1),name='y-input')

a = tf.matmul(x,w1)
y = tf.matmul(a,w2)

cross_entropy = -tf.reduce_mean(y_ * tf.log(tf.clip_by_value(y,1e-10,1.0)))
train_step = tf.train.AdamOptimizer(0.001).minimize(cross_entropy)

rdm = RandomState(1)
dataset_size = 128000
X = rdm.rand(dataset_size,2)
Y = [[int(x1+x2 < 1)] for (x1,x2) in X]

with tf.Session() as sess:
    init_op = tf.global_variables_initializer()
    sess.run(init_op)

    print(sess.run(w1))
    print(sess.run(w2))

    STEPS = 900000
    for i in range(STEPS):
        start = (i * batch_size) % dataset_size
        end = min(start+batch_size,dataset_size)

        sess.run(train_step,feed_dict={x:X[start:end],y_:Y[start:end]})

        if i%1000 == 0:
            total_cross_entropy = sess.run(cross_entropy,feed_dict={x:X,y_:Y})
            print("After %d training step(s),cross entropy on all data is %g" % (i,total_cross_entropy))

    print(sess.run(w1))
    print(sess.run(w2))

problem

The program hung up after output Created TensorFlow device

$ python CNN_TensorFlow.py 
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-07-08 02:28:04.238654: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-08 02:28:04.416921: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x41c9670 executing computations on platform CUDA. Devices:
2020-07-08 02:28:04.416985: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-07-08 02:28:04.422450: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2599995000 Hz
2020-07-08 02:28:04.427351: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x431e600 executing computations on platform Host. Devices:
2020-07-08 02:28:04.427406: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-07-08 02:28:04.438859: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:05:00.0
totalMemory: 11.75GiB freeMemory: 11.69GiB
2020-07-08 02:28:04.438910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-07-08 02:28:04.443735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-08 02:28:04.443779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-07-08 02:28:04.443794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2020-07-08 02:28:04.452388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11376 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:05:00.0, compute capability: 3.7)

log

Here is the log, it repeat output Hijacking nvml...

python CNN_TensorFlow.py 
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-07-08 02:31:11.106661: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
/tmp/cuda-control/src/loader.c:941 config file: /etc/vcuda/a2371bd1e50fd3d2fe175bdeeb21df2727149cf29d4ed12f4fd3fc737fb7f163/vcuda.config
/tmp/cuda-control/src/loader.c:942 pid file: /etc/vcuda/a2371bd1e50fd3d2fe175bdeeb21df2727149cf29d4ed12f4fd3fc737fb7f163/pids.config
/tmp/cuda-control/src/loader.c:946 register to remote: pod uid: 24993e70-c0c2-11ea-97bf-40167e346bb0, cont id: a2371bd1e50fd3d2fe175bdeeb21df2727149cf29d4ed12f4fd3fc737fb7f163
/tmp/cuda-control/src/loader.c:1044 pod uid          : 24993e70-c0c2-11ea-97bf-40167e346bb0
/tmp/cuda-control/src/loader.c:1045 limit            : 0
/tmp/cuda-control/src/loader.c:1046 container name   : tensorflow-vcuda-test
/tmp/cuda-control/src/loader.c:1047 total utilization: 90
/tmp/cuda-control/src/loader.c:1048 total gpu memory : 12616466432
/tmp/cuda-control/src/loader.c:1049 driver version   : 418.39
/tmp/cuda-control/src/loader.c:1050 hard limit mode  : 1
/tmp/cuda-control/src/loader.c:1051 enable mode      : 1
/tmp/cuda-control/src/loader.c:767 Start hijacking
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuEGLInit
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuDeviceGetNvSciSyncAttributes
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuGraphExecHostNodeSetParams
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuGraphExecMemcpyNodeSetParams
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuGraphExecMemsetNodeSetParams
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuGraphExecUpdate
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemAddressFree
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemAddressReserve
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemCreate
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemExportToShareableHandle
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemGetAccess
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemGetAllocationGranularity
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemGetAllocationPropertiesFromHandle
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemImportFromShareableHandle
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemMap
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemRelease
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemSetAccess
/tmp/cuda-control/src/loader.c:783 can't find function libcuda.so.418.39 in cuMemUnmap
/tmp/cuda-control/src/loader.c:733 can't find function libnvidia-ml.so.418.39 in nvmlDeviceGetGridLicensableFeatures_v3
/tmp/cuda-control/src/loader.c:733 can't find function libnvidia-ml.so.418.39 in nvmlDeviceGetHostVgpuMode
/tmp/cuda-control/src/loader.c:733 can't find function libnvidia-ml.so.418.39 in nvmlDeviceGetPgpuMetadataString
/tmp/cuda-control/src/loader.c:733 can't find function libnvidia-ml.so.418.39 in nvmlVgpuInstanceGetEccMode
/tmp/cuda-control/src/hijack_call.c:500 total cuda cores: 851968
/tmp/cuda-control/src/hijack_call.c:217 start utilization_watcher
/tmp/cuda-control/src/hijack_call.c:218 sm: 13, thread per sm: 2048
/tmp/cuda-control/src/loader.c:1044 pod uid          : 24993e70-c0c2-11ea-97bf-40167e346bb0
/tmp/cuda-control/src/loader.c:1045 limit            : 0
/tmp/cuda-control/src/loader.c:1046 container name   : tensorflow-vcuda-test
/tmp/cuda-control/src/loader.c:1047 total utilization: 90
/tmp/cuda-control/src/loader.c:1048 total gpu memory : 12616466432
/tmp/cuda-control/src/loader.c:1049 driver version   : 418.39
/tmp/cuda-control/src/loader.c:1050 hard limit mode  : 1
/tmp/cuda-control/src/loader.c:1051 enable mode      : 1
2020-07-08 02:31:11.291318: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4f81f30 executing computations on platform CUDA. Devices:
2020-07-08 02:31:11.291408: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-07-08 02:31:11.296666: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2599995000 Hz
2020-07-08 02:31:11.302402: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x50d6ed0 executing computations on platform Host. Devices:
2020-07-08 02:31:11.302447: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
/tmp/cuda-control/src/hijack_call.c:399 Hijacking nvmlInit

/tmp/cuda-control/src/hijack_call.c:402 Hijacking nvmlDeviceGetHandleByIndex

/tmp/cuda-control/src/hijack_call.c:412 Hijacking nvmlDeviceGetComputeRunningProcesses

/tmp/cuda-control/src/hijack_call.c:425 summary: 27275 used 58982400
/tmp/cuda-control/src/hijack_call.c:432 27275 use memory: 58982400
/tmp/cuda-control/src/hijack_call.c:437 total used memory: 58982400
/tmp/cuda-control/src/hijack_call.c:440 Hijacking nvmlShutdown

2020-07-08 02:31:11.313704: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:05:00.0
totalMemory: 11.75GiB freeMemory: 11.69GiB
2020-07-08 02:31:11.313754: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
/tmp/cuda-control/src/loader.c:1044 pod uid          : 24993e70-c0c2-11ea-97bf-40167e346bb0
/tmp/cuda-control/src/loader.c:1045 limit            : 0
/tmp/cuda-control/src/loader.c:1046 container name   : tensorflow-vcuda-test
/tmp/cuda-control/src/loader.c:1047 total utilization: 90
/tmp/cuda-control/src/loader.c:1048 total gpu memory : 12616466432
/tmp/cuda-control/src/loader.c:1049 driver version   : 418.39
/tmp/cuda-control/src/loader.c:1050 hard limit mode  : 1
/tmp/cuda-control/src/loader.c:1051 enable mode      : 1
/tmp/cuda-control/src/loader.c:1044 pod uid          : 24993e70-c0c2-11ea-97bf-40167e346bb0
/tmp/cuda-control/src/loader.c:1045 limit            : 0
/tmp/cuda-control/src/loader.c:1046 container name   : tensorflow-vcuda-test
/tmp/cuda-control/src/loader.c:1047 total utilization: 90
/tmp/cuda-control/src/loader.c:1048 total gpu memory : 12616466432
/tmp/cuda-control/src/loader.c:1049 driver version   : 418.39
/tmp/cuda-control/src/loader.c:1050 hard limit mode  : 1
/tmp/cuda-control/src/loader.c:1051 enable mode      : 1
2020-07-08 02:31:11.315833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-08 02:31:11.315868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2020-07-08 02:31:11.315882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
/tmp/cuda-control/src/hijack_call.c:399 Hijacking nvmlInit

/tmp/cuda-control/src/hijack_call.c:307 Hijacking nvmlInit

/tmp/cuda-control/src/hijack_call.c:310 Hijacking nvmlDeviceGetHandleByIndex

/tmp/cuda-control/src/hijack_call.c:402 Hijacking nvmlDeviceGetHandleByIndex

/tmp/cuda-control/src/hijack_call.c:318 Hijacking nvmlDeviceGetComputeRunningProcesses

/tmp/cuda-control/src/hijack_call.c:412 Hijacking nvmlDeviceGetComputeRunningProcesses

/tmp/cuda-control/src/hijack_call.c:425 summary: 27275 used 58982400
/tmp/cuda-control/src/hijack_call.c:432 27275 use memory: 58982400
/tmp/cuda-control/src/hijack_call.c:437 total used memory: 58982400
/tmp/cuda-control/src/hijack_call.c:440 Hijacking nvmlShutdown

/tmp/cuda-control/src/hijack_call.c:331 Hijacking nvmlDeviceGetProcessUtilization

/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 0
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 0
/tmp/cuda-control/src/hijack_call.c:364 Hijacking nvmlShutdown

2020-07-08 02:31:11.324331: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11376 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:05:00.0, compute capability: 3.7)
/tmp/cuda-control/src/hijack_call.c:307 Hijacking nvmlInit

/tmp/cuda-control/src/hijack_call.c:310 Hijacking nvmlDeviceGetHandleByIndex

/tmp/cuda-control/src/hijack_call.c:318 Hijacking nvmlDeviceGetComputeRunningProcesses

/tmp/cuda-control/src/hijack_call.c:331 Hijacking nvmlDeviceGetProcessUtilization

/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 0
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 0
/tmp/cuda-control/src/hijack_call.c:364 Hijacking nvmlShutdown

/tmp/cuda-control/src/hijack_call.c:307 Hijacking nvmlInit

/tmp/cuda-control/src/hijack_call.c:310 Hijacking nvmlDeviceGetHandleByIndex

/tmp/cuda-control/src/hijack_call.c:318 Hijacking nvmlDeviceGetComputeRunningProcesses

/tmp/cuda-control/src/hijack_call.c:331 Hijacking nvmlDeviceGetProcessUtilization

/tmp/cuda-control/src/hijack_call.c:360 sys utilization: 0
/tmp/cuda-control/src/hijack_call.c:361 used utilization: 0
/tmp/cuda-control/src/hijack_call.c:364 Hijacking nvmlShutdown
pokerfaceSad commented 4 years ago

I find it will hung if call tf.Session()

mYmNeo commented 4 years ago

Can you provide processes the stack /proc//stack?

pokerfaceSad commented 4 years ago

/proc//stack

Here it is

$ sudo cat /proc/19384/stack
[<ffffffff81104de2>] futex_wait_queue_me+0xc2/0x120
[<ffffffff81105936>] futex_wait+0x116/0x280
[<ffffffff81107dd6>] do_futex+0x126/0x540
[<ffffffff81108271>] SyS_futex+0x81/0x180
[<ffffffff8185328e>] entry_SYSCALL_64_fastpath+0x22/0xc1
[<ffffffffffffffff>] 0xffffffffffffffff
mYmNeo commented 4 years ago

What's the version of your gpu-manager? Did you run your program in non-root mode?

pokerfaceSad commented 4 years ago

What's the version of your gpu-manager? Did you run your program in non-root mode?

  1. The gpu-manager-daemonset image is tkestack/gpu-manager:1.1.0.
  2. I use tensorflow/tensorflow:1.13.1-gpu-py3 image to run tf code, it runs as root mode by default. Will it matter?
mYmNeo commented 4 years ago

What's the version of your gpu-manager? Did you run your program in non-root mode?

  1. The gpu-manager-daemonset image is tkestack/gpu-manager:1.1.0.
  2. I use tensorflow/tensorflow:1.13.1-gpu-py3 image to run tf code, it runs as root mode by default. Will it matter?

It's weird. I've run the your example python script and it works.

Can you provide these information?

pokerfaceSad commented 4 years ago

What's the version of your gpu-manager? Did you run your program in non-root mode?

  1. The gpu-manager-daemonset image is tkestack/gpu-manager:1.1.0.
  2. I use tensorflow/tensorflow:1.13.1-gpu-py3 image to run tf code, it runs as root mode by default. Will it matter?

It's weird. I've run the your example python script and it works.

Can you provide these information?

  • Container runtime
  • ps aux|grep "gpu-"
  • gpu-manager.yaml
pokerfaceSad commented 4 years ago

@mYmNeo I try the same config on another host, it works well. The difference is the gpu type and gpu driver version. the Tesla K80 & Driver Version: 418.39 & CUDA Version: 10.1 one doesn't work( works well if not use vcuda-controller), the Tesla V100 & Driver Version: 418.87.01 & CUDA Version: 10.1 one works well.

I'm not sure if gpu type and gpu driver version lead this problem.

pokerfaceSad commented 4 years ago

@mYmNeo I think it stuck in this this loop, because it can't get correct gpu utilization for some reasons.

mYmNeo commented 4 years ago

@mYmNeo I try the same config on another host, it works well. The difference is the gpu type and gpu driver version. the Tesla K80 & Driver Version: 418.39 & CUDA Version: 10.1 one doesn't work( works well if not use vcuda-controller), the Tesla V100 & Driver Version: 418.87.01 & CUDA Version: 10.1 one works well.

I'm not sure if gpu type and gpu driver version lead this problem.

Sorry,I didn't noticed that your card is K series. The driver api which NVIDIA provides about utilization doesn't work on K series cards. I'll add a notice in README to mention this scenario.

pokerfaceSad commented 4 years ago

@mYmNeo I try the same config on another host, it works well. The difference is the gpu type and gpu driver version. the Tesla K80 & Driver Version: 418.39 & CUDA Version: 10.1 one doesn't work( works well if not use vcuda-controller), the Tesla V100 & Driver Version: 418.87.01 & CUDA Version: 10.1 one works well. I'm not sure if gpu type and gpu driver version lead this problem.

Sorry,I didn't noticed that your card is K series. The driver api which NVIDIA provides about utilization doesn't work on K series cards. I'll add a notice in README to mention this scenario.

Great, thx for your reply for two weeks , we finally found the cause of the problem 🤣🤣. By the way, do gpu-manager and vcuda-controller accept code contributions from third-party developers? I think they are good projects, may be I can do something.

mYmNeo commented 4 years ago

@mYmNeo I try the same config on another host, it works well. The difference is the gpu type and gpu driver version. the Tesla K80 & Driver Version: 418.39 & CUDA Version: 10.1 one doesn't work( works well if not use vcuda-controller), the Tesla V100 & Driver Version: 418.87.01 & CUDA Version: 10.1 one works well. I'm not sure if gpu type and gpu driver version lead this problem.

Sorry,I didn't noticed that your card is K series. The driver api which NVIDIA provides about utilization doesn't work on K series cards. I'll add a notice in README to mention this scenario.

Great, thx for your reply for two weeks , we finally found the cause of the problem 🤣🤣. By the way, do gpu-manager and vcuda-controller accept code contributions from third-party developers? I think they are good projects, may be I can do something.

Definitely yes. Welcome to any fascinating ideas and PR.

pokerfaceSad commented 4 years ago

@mYmNeo I try the same config on another host, it works well. The difference is the gpu type and gpu driver version. the Tesla K80 & Driver Version: 418.39 & CUDA Version: 10.1 one doesn't work( works well if not use vcuda-controller), the Tesla V100 & Driver Version: 418.87.01 & CUDA Version: 10.1 one works well. I'm not sure if gpu type and gpu driver version lead this problem.

Sorry,I didn't noticed that your card is K series. The driver api which NVIDIA provides about utilization doesn't work on K series cards. I'll add a notice in README to mention this scenario.

Great, thx for your reply for two weeks , we finally found the cause of the problem 🤣🤣. By the way, do gpu-manager and vcuda-controller accept code contributions from third-party developers? I think they are good projects, may be I can do something.

Definitely yes. Welcome to any fascinating ideas and PR.

Great, I will close this issue.