rlworkgroup / garage

A toolkit for reproducible reinforcement learning research.
MIT License
1.86k stars 310 forks source link

TensorFlow launches twice, crashing with CUDA_ERROR_OUT_OF_MEMORY #72

Closed ryanjulian closed 6 years ago

ryanjulian commented 6 years ago
(garage) rjulian@nyquist:~/code/garage$ python contrib/bichengcao/examples/trpo_gym_Pendulum-v0.py
2018-06-13 12:32:00.578288: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2018-06-13 12:32:00.962006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:17:00.0
totalMemory: 11.91GiB freeMemory: 11.74GiB
2018-06-13 12:32:01.269663: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:65:00.0
totalMemory: 11.91GiB freeMemory: 11.74GiB
2018-06-13 12:32:01.574830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 2 with properties: 
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:b3:00.0
totalMemory: 5.93GiB freeMemory: 4.93GiB
2018-06-13 12:32:01.576350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2
2018-06-13 12:32:02.114822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-13 12:32:02.114860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 1 2 
2018-06-13 12:32:02.114868: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N Y N 
2018-06-13 12:32:02.114875: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1:   Y N N 
2018-06-13 12:32:02.114881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2:   N N N 
2018-06-13 12:32:02.115360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11368 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:17:00.0, compute capability: 6.1)
2018-06-13 12:32:02.211400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11368 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000:65:00.0, compute capability: 6.1)
2018-06-13 12:32:02.306914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 4692 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1060 6GB, pci bus id: 0000:b3:00.0, compute capability: 6.1)
python /home/rjulian/code/garage/scripts/run_experiment_lite.py  --log_dir '/home/rjulian/code/garage/data/local/experiment/experiment_2018_06_13_12_32_05_0001'  --plot 'True'  --exp_name 'experiment_2018_06_13_12_32_05_0001'  --variant_data 'gAN9cQBYCAAAAGV4cF9uYW1lcQFYIwAAAGV4cGVyaW1lbnRfMjAxOF8wNl8xM18xMl8zMl8wNV8wMDAxcQJzLg=='  --use_cloudpickle 'True'  --args_data 'gASVngMAAAAAAACMF2Nsb3VkcGlja2xlLmNsb3VkcGlja2xllIwOX2ZpbGxfZnVuY3Rpb26Uk5QoaACMD19tYWtlX3NrZWxfZnVuY5STlGgAjA1fYnVpbHRpbl90eXBllJOUjAhDb2RlVHlwZZSFlFKUKEsASwBLBUsTS0dDknQAAGoBAGQBAIMBAH0BAHQCAGQCAHQDAHwBAIMBAGQDAGQTAIMAAn0CAHQEAGQCAHQDAHwBAIMBAIMAAX0DAHQFAGQFAHwBAGQGAHwCAGQHAHwDAGQIAGQJAGQKAHQGAHwBAIMBAGQLAGQMAGQNAGQOAGQPAGQQAGQRAGQSAIMACX0EAHwEAGoHAIMAAAFkAABTlChOjAtQZW5kdWx1bS12MJSMCGVudl9zcGVjlIwMaGlkZGVuX3NpemVzlEsgjANlbnaUjAZwb2xpY3mUjAhiYXNlbGluZZSMCmJhdGNoX3NpemWUTaAPjA9tYXhfcGF0aF9sZW5ndGiUjAVuX2l0cpRLMowIZGlzY291bnSURz/vrhR64UeujAlzdGVwX3NpemWURz+EeuFHrhR7jARwbG90lIhLIEsghpR0lCiMA2d5bZSMBG1ha2WUjBFHYXVzc2lhbk1MUFBvbGljeZSMBHNwZWOUjBVMaW5lYXJGZWF0dXJlQmFzZWxpbmWUjARUUlBPlIwHaG9yaXpvbpSMBXRyYWlulHSUKIwBX5RoDmgPaBCMBGFsZ2+UdJSMM2NvbnRyaWIvYmljaGVuZ2Nhby9leGFtcGxlcy90cnBvX2d5bV9QZW5kdWx1bS12MC5weZSMCHJ1bl90YXNrlEsLQxwAAQ8CGwIVAgYBBgEGAQYBBgEMAQYBBgEGAQkClCkpdJRSlEr/////fZSHlFKUfZQojAhkZWZhdWx0c5ROjA5jbG9zdXJlX3ZhbHVlc5ROjARkaWN0lH2UjAhxdWFsbmFtZZRoJowHZ2xvYmFsc5R9lChoHIwPcmxsYWIuZW52cy51dGlslGgck5RoHYwncmxsYWIuYmFzZWxpbmVzLmxpbmVhcl9mZWF0dXJlX2Jhc2VsaW5llGgdk5RoH2g1aB+TlGgbjCJybGxhYi5wb2xpY2llcy5nYXVzc2lhbl9tbHBfcG9saWN5lGgbk5RoHowQcmxsYWIuYWxnb3MudHJwb5RoHpOUaBloAIwJc3ViaW1wb3J0lJOUaBmFlFKUdYwGbW9kdWxllIwIX19tYWluX1+UdXRSLg=='  --n_parallel '1'  --snapshot_mode 'last'
2018-06-13 12:32:07.078696: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2018-06-13 12:32:07.776116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:17:00.0
totalMemory: 11.91GiB freeMemory: 428.06MiB
2018-06-13 12:32:08.067162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 1 with properties: 
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:65:00.0
totalMemory: 11.91GiB freeMemory: 428.06MiB
2018-06-13 12:32:08.349047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 2 with properties: 
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:b3:00.0
totalMemory: 5.93GiB freeMemory: 225.56MiB
2018-06-13 12:32:08.349483: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0, 1, 2
2018-06-13 12:32:08.893459: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-13 12:32:08.893498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 1 2 
2018-06-13 12:32:08.893506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N Y N 
2018-06-13 12:32:08.893512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 1:   Y N N 
2018-06-13 12:32:08.893518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 2:   N N N 
2018-06-13 12:32:08.893887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 147 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:17:00.0, compute capability: 6.1)
2018-06-13 12:32:08.896111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 147 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000:65:00.0, compute capability: 6.1)
2018-06-13 12:32:08.898329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 169 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1060 6GB, pci bus id: 0000:b3:00.0, compute capability: 6.1)
2018-06-13 12:32:08.898867: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 169.56M (177799168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-06-13 12:32:09.939292 PDT | tensorboard data will be logged into:/home/rjulian/code/garage/data/local/experiment/experiment_2018_06_13_12_32_05_0001/progress
ryanjulian commented 6 years ago

This was probably introduced by the TensorBoard logger

ryanjulian commented 6 years ago

This also means we can't launch tensorboard in parallel with training

(garage) rjulian@nyquist:~/.../garage/sandbox/embed2learn$ tensorboard --logdir ../../data/local/trpo-point-embed/
2018-06-14 16:50:11.081257: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2018-06-14 16:50:12.551595: E tensorflow/core/common_runtime/direct_session.cc:154] Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 12788498432
Traceback (most recent call last):
  File "/home/rjulian/miniconda2/envs/garage/bin/tensorboard", line 11, in <module>
    sys.exit(run_main())
  File "/home/rjulian/miniconda2/envs/garage/lib/python3.5/site-packages/tensorboard/main.py", line 36, in run_main
    tf.app.run(main)
  File "/home/rjulian/miniconda2/envs/garage/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/home/rjulian/miniconda2/envs/garage/lib/python3.5/site-packages/tensorboard/main.py", line 45, in main
    default.get_assets_zip_provider())
  File "/home/rjulian/miniconda2/envs/garage/lib/python3.5/site-packages/tensorboard/program.py", line 166, in main
    tb = create_tb_app(plugins, assets_zip_provider)
  File "/home/rjulian/miniconda2/envs/garage/lib/python3.5/site-packages/tensorboard/program.py", line 201, in create_tb_app
    flags=FLAGS)
  File "/home/rjulian/miniconda2/envs/garage/lib/python3.5/site-packages/tensorboard/backend/application.py", line 126, in standard_tensorboard_wsgi
    plugin_instances = [constructor(context) for constructor in plugins]
  File "/home/rjulian/miniconda2/envs/garage/lib/python3.5/site-packages/tensorboard/backend/application.py", line 126, in <listcomp>
    plugin_instances = [constructor(context) for constructor in plugins]
  File "/home/rjulian/miniconda2/envs/garage/lib/python3.5/site-packages/tensorboard/plugins/beholder/beholder_plugin.py", line 47, in __init__
    self.most_recent_frame = im_util.get_image_relative_to_script('no-data.png')
  File "/home/rjulian/miniconda2/envs/garage/lib/python3.5/site-packages/tensorboard/plugins/beholder/im_util.py", line 254, in get_image_relative_to_script
    return read_image(filename)
  File "/home/rjulian/miniconda2/envs/garage/lib/python3.5/site-packages/tensorboard/plugins/beholder/im_util.py", line 242, in read_image
    return np.array(decode_png(image_file.read()))
  File "/home/rjulian/miniconda2/envs/garage/lib/python3.5/site-packages/tensorboard/plugins/beholder/im_util.py", line 159, in __call__
    self._lazily_initialize()
  File "/home/rjulian/miniconda2/envs/garage/lib/python3.5/site-packages/tensorboard/plugins/beholder/im_util.py", line 137, in _lazily_initialize
    self._session = tf.Session(graph=graph, config=config)
  File "/home/rjulian/miniconda2/envs/garage/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1560, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/home/rjulian/miniconda2/envs/garage/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 633, in __init__
    self._session = tf_session.TF_NewSession(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
ryanjulian commented 6 years ago

Confirmed that this error only occurs with tensorflow_gpu

CatherineSue commented 6 years ago

Fixed in #88