paninski-lab / deepgraphpose

DeepGraphPose
GNU Lesser General Public License v3.0
32 stars 9 forks source link

CUDA_ERROR_OUT_OF_MEMORY with 8GB GPU #7

Closed obarnstedt closed 3 years ago

obarnstedt commented 3 years ago

Hi, I'm very curious how DGP performs on our existing DLC data, so I installed DGP following instructions on Ubuntu 20.04 with a GeForce RTX 2080 (8GB) and Cuda Toolkit 10.0.130, driver version 450.102.04. On this machine, DLC (2.0.8) works without problems, but I'm running into memory problems when trying the test run 'python demo/run_dgp_demo.py --dlcpath data/Reaching-Mackenzie-2018-08-30 --test'. Memory monitoring shows used memory at about 5.5GB when it tries to allocate an additional 2.53GB. Is there a way to circumvent this error? With DLC, I used to solve this by allowing GPU growth, but I could see in the code this has already been included... Maybe this part of the error message is key: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory Below is the full output. Thanks a lot! Oliver


            =====================
            |                   |
            |                   |
            |    Running DGP    |
            |                   |
            |                   |
            =====================

config_path /home/oliver/Git/deepgraphpose/data/Reaching-Mackenzie-2018-08-30/config.yaml
/home/oliver/Git/deepgraphpose/data/Reaching-Mackenzie-2018-08-30/dlc-models/iteration-0/ReachingAug30-trainset95shuffle1/train/pose_cfg.yaml
Warning. Check the number of frames
Warning. Check the number of frames
Initializing ResNet
reachingvideo1
[  5  20  23  28  31  33  36  37  38  40  42  46  48  52  60  68  71  75
  77  80  87  90 100 103 108 118 119 126 141 142 145 151 152 157 167 168
 177 179 180 194 211 213 214 225 227 228 230 231 234 237 240 245]

Creating training datasets
--------------------------
loading hidden indices from /home/oliver/Git/deepgraphpose/data/Reaching-Mackenzie-2018-08-30/dlc-models/iteration-0/ReachingAug30-trainset95shuffle1/train/batched_data/snapshot-0/reachingvideo1__nsjump=None_step=1_ns=10_nc=2048_max=2000_idxs.npy
Starting with standard pose-dataset loader.

n_hidden_frames_total 204
n_visible_frames_total 52
n_frames_total 256
WARNING:py.warnings:/home/oliver/anaconda3/envs/dgp/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:110: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

restoring resnet weights from /home/oliver/Git/deepgraphpose/data/Reaching-Mackenzie-2018-08-30/dlc-models/iteration-0/ReachingAug30-trainset95shuffle1/train/snapshot-step1-final--0
Begin Training for 5 iterations
2021-03-04 10:29:53.912679: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:29:53.913370: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:29:53.997739: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:29:53.998346: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:04.001006: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:04.003373: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:04.051038: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:04.051662: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:04.058106: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:04.058745: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:14.061350: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:14.063780: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:14.085544: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:14.086443: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:24.089084: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:24.091384: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:24.113320: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:24.114464: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:24.115391: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:24.116299: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:24.122787: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:24.123699: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:24.124598: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:24.125238: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:24.125967: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:24.126614: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:34.129304: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:34.131674: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:34.136555: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:34.138764: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:44.141494: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-03-04 10:30:44.143829: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.53G (2715310336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
Traceback (most recent call last):
  File "/home/oliver/anaconda3/envs/dgp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/oliver/anaconda3/envs/dgp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/oliver/anaconda3/envs/dgp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10,187,208,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node resnet_v1_50/block1/unit_2/bottleneck_v1/conv3/Conv2D-0-1-TransposeNCHWToNHWC-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[{{node ConstantFoldingCtrl/absolute_difference/weighted_loss/assert_broadcastable/AssertGuard/Switch_0}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "demo/run_dgp_demo.py", line 243, in <module>
    gm3=gm3)
  File "/home/oliver/Git/deepgraphpose/src/deepgraphpose/models/fitdgp.py", line 816, in fit_dgp
    [loss_eval, _] = sess.run([loss, train_op], feed_dict)
  File "/home/oliver/anaconda3/envs/dgp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/oliver/anaconda3/envs/dgp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/oliver/anaconda3/envs/dgp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/oliver/anaconda3/envs/dgp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[10,187,208,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node resnet_v1_50/block1/unit_2/bottleneck_v1/conv3/Conv2D-0-1-TransposeNCHWToNHWC-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[{{node ConstantFoldingCtrl/absolute_difference/weighted_loss/assert_broadcastable/AssertGuard/Switch_0}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
`
obarnstedt commented 3 years ago

I got the demo working now. I had to include

    import tensorflow as tf
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    sess = tf.Session(config=config)
    tf.logging.set_verbosity(tf.logging.ERROR)

in the run_dgp_demo.py (line 158) to evade another error message, and then, importantly, decreased --batch_size to 4 (default 10). Interestingly, after it has run through once, I can increase the batch size again to 10 or even 40, without any error messages.

ZiyiZhang0912 commented 3 years ago

hi, I encountered the same error, but after trying the method added the code, it still did not get resolved. What is the reason for this? Could you help me? Thanks!

waq1129 commented 3 years ago

Hi Ziyi,

How large is your frame size? If the input frame is large, it's possible that no matter how small the batch size, it won't fit the memory.