hongzimao / pensieve

Neural Adaptive Video Streaming with Pensieve (SIGCOMM '17)
http://web.mit.edu/pensieve/
MIT License
517 stars 280 forks source link

failed to allocate 71.81M (75300864 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY &&& failed call to cuInit: CUDA_ERROR_NO_DEVICE #82

Open shankyle opened 5 years ago

shankyle commented 5 years ago

Thanks for your valuable work! As I set the environment as you said(Ubuntu16.04TLS, python2.7, tensorflow1.1.0-gpu, tflearn0.3.1, Selenium v2.39.0, titanXP) with a nvidia official docker. I did a test to ensure the GPUs using with a demo code of tensorflow and pytorch, the results are shown as follows:

import tensorflow as tf
sess = tf.Session()

and got correct response.

name: TITAN Xp
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:0f:00.0
Total memory: 11.90GiB
Free memory: 11.73GiB
2019-08-03 04:40:46.558896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2019-08-03 04:40:46.558906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y
2019-08-03 04:40:46.558925: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN Xp, pci bus id: 0000:0f:00.0)

When I run multi_agent.py(set os.environ['CUDA_VISIBLE_DEVICES']='7'), I got:

name: TITAN Xp
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:0f:00.0
Total memory: 11.90GiB
Free memory: 71.81MiB
2019-08-03 03:58:18.296606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2019-08-03 03:58:18.296617: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y
2019-08-03 03:58:18.296638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN Xp, pci bus id: 0000:0f:00.0)
2019-08-03 03:58:18.305218: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to allocate 71.81M (75300864 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-08-03 03:58:18.321300: E tensorflow/stream_executor/cuda/cuda_driver.cc:893] failed to allocate 64.63M (67770880 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-08-03 03:58:19.090678: E tensorflow/core/common_runtime/direct_session.cc:137] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY
; total memory reported: 12782075904

I used just one titanXP GPU, I think it is because of the unrestricted application for GPU memory, so I add the following code:

    config = tf.ConfigProto(allow_soft_placement=True)
    gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.7)
    config.gpu_options.allow_growth = True
    with tf.Session(config=config)

Then I run the multi_agent.py, I found that the problem above is gone but linked to another issue:

2019-08-02 18:07:24.522783: E tensorflow/stream_executor/cuda/cuda_driver.cc:405] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2019-08-02 18:07:24.522840: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: 27f8d8a01a9b
2019-08-02 18:07:24.522857: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: 27f8d8a01a9b
2019-08-02 18:07:24.522925: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 384.130.0
2019-08-02 18:07:24.522964: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  384.130  Wed Mar 21 03:37:26 PDT 2018
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10)
"""
2019-08-02 18:07:24.522990: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 384.130.0
2019-08-02 18:07:24.523004: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 384.130.0
Testing model restored.

The code can still be run and GPU's memory is consumpted, but the speed is as the same as only using CPU training. I guess it is because of the CUDA_ERROR_NO_DEVICE. In your paper, you said that 50000 iterations cost 4 hours, but with the above setting I need 10 hours for the same iterations. Which GPUs did you use? And is there another way to train the original code with GPU? I even change the code to python3 with higher tensorflow version, on another new machine, but got the same result. I really hope to get your help!

hongzimao commented 5 years ago

Note that the model isn't particularly large that requires a GPU to speed up the training. You can just use some fast CPUs for training.