ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.36k stars 5.65k forks source link

[rllib] a3c with pysc2, OOM #1090

Closed linshiyx closed 7 years ago

linshiyx commented 7 years ago

env: Ubuntu 16.04, python3.5, GTX1070 I modify the a3c in rllib to run with pysc2, but it crash on startup and print following error message.

2017-10-09 10:23:48.923784: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 2017-10-09 10:23:48.923795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y 2017-10-09 10:23:48.923809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:02:00.0) Game has started. 2017-10-09 10:23:56.186773: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 2.00G (2147483648 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2017-10-09 10:23:56.188051: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 1.80G (1932735232 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2017-10-09 10:23:56.189160: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 1.62G (1739461632 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2017-10-09 10:23:56.190288: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 1.46G (1565515520 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2017-10-09 10:23:56.191386: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 1.31G (1408964096 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2017-10-09 10:23:56.192496: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 1.18G (1268067840 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2017-10-09 10:23:56.193613: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 1.06G (1141261056 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2017-10-09 10:23:56.194712: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 979.55M (1027134976 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2017-10-09 10:23:56.195824: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 881.60M (924421632 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY 2017-10-09 10:23:56.218034: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 2.27G (2441510912 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

ericl commented 7 years ago

Is it crashing in the workers? One thing to try is exporting the env var CUDA_VISIBLE_DEVICES="" to disable the usage of GPUs. Alternatively, you could try modifying the Runner class to use specific GPUs as documented in http://ray.readthedocs.io/en/latest/using-ray-with-gpus.html.

The underlying issue seems to be that the A3C implementation is not setting any GPU resource restrictions by default, but in ray 0.2 this ends up allowing all workers to use GPUs, which could cause memory conflicts as tensorflow eagerly allocates all GPU memory. cc @richardliaw

linshiyx commented 7 years ago

Thanks, disabling the usage of GPUs worked. Hoping to see options in ray to control the usage of GPU.

robertnishihara commented 7 years ago

Great! Closing for now.

As of #1044, we should automatically set CUDA_VISIBLE_DEVICES to the empty string unless a task/actor explicitly requests some GPUs.