RND stuck at initialization for Montezuma's Revenge

seungjaeryanlee commented 5 years ago

I created a VM instance and setup TF-Agents to evaluate my RND. However, it has been stuck for over an hour with just the following output:

Command:

python tf_agents/agents/ppo/examples/v2/train_eval_rnd.py  \
  --root_dir=$HOME/tmp/rndppo/gym/MontezumaRevengeNoFrameskip-v4/  \
  --logtostderr --env_name=MontezumaRevengeNoFrameskip-v4

End of Output:

2019-07-10 11:20:57.525518: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-07-10 11:20:57.525886: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-07-10 11:20:57.527944: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-07-10 11:20:57.529476: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-07-10 11:20:57.533817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-07-10 11:20:57.533963: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-10 11:20:57.534318: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-10 11:20:57.534617: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-07-10 11:20:57.535085: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-07-10 11:20:57.654203: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-10 11:20:57.654721: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55efc5579bd0 executing computations on platform CUDA. Devices:
2019-07-10 11:20:57.654771: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla P4, Compute Capability 6.1
2019-07-10 11:20:57.658236: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2000165000 Hz
2019-07-10 11:20:57.659601: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55efc55f0840 executing computations on platform Host. Devices:
2019-07-10 11:20:57.659633: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-07-10 11:20:57.659856: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-10 11:20:57.660225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla P4 major: 6 minor: 1 memoryClockRate(GHz): 1.1135
pciBusID: 0000:00:04.0
2019-07-10 11:20:57.660281: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-07-10 11:20:57.660316: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-07-10 11:20:57.660326: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-07-10 11:20:57.660335: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-07-10 11:20:57.660356: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-07-10 11:20:57.660365: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-07-10 11:20:57.660381: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-07-10 11:20:57.660440: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-10 11:20:57.660760: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-10 11:20:57.661044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-07-10 11:20:57.661134: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-07-10 11:20:57.661798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-10 11:20:57.661823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2019-07-10 11:20:57.661831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2019-07-10 11:20:57.662069: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-10 11:20:57.662421: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-10 11:20:57.662729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7116 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:00:04.0, compute capability: 6.1)
I0710 11:20:57.880251 139973894788480 parallel_py_environment.py:81] Spawning all processes.
I0710 11:21:03.859619 139973894788480 parallel_py_environment.py:88] All processes started.
2019-07-10 11:21:04.604089: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 12108096000 exceeds 10% of system memory.
2019-07-10 11:21:06.498569: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

Ctrl+C also doesn't seem to kill it, so I had to use kill command. Are you aware of any particular causes for this issue?

seungjaeryanlee commented 5 years ago

More information: Running RND with LunarLander-v2 or CartPole-v0 works as intended.

python tf_agents/agents/ppo/examples/v2/train_eval_rnd.py \
  --root_dir=$HOME/tmp/rndppo/gym/LunarLander-v2/   \
  --logtostderr

python tf_agents/agents/ppo/examples/v2/train_eval_rnd.py \
  --root_dir=$HOME/tmp/rndppo/gym/CartPole-v0/ \
  --logtostderr --env_name=CartPole-v0

seungjaeryanlee commented 5 years ago

Reducing hyperparameters does not seem to have any effect :(

num_parallel_environments: 30 -> 4
num_epochs: 16 -> 4
collect_episodes_per_iteration: 16 -> 4
num_eval_episodes: 30 -> 4

seungjaeryanlee commented 5 years ago

The variant MontezumaRevenge-v0 passes that step! It however outputs an OOM error.

A small fraction of OOM error log is below:

2019-07-10 11:52:21.030124: I tensorflow/core/common_runtime/bfc_allocator.cc:816] Sum Total of in-use chunks: 5.57GiB
2019-07-10 11:52:21.030129: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 7462076416 memory_limit_: 7462076416 available bytes: 0 curr_region_allocation_bytes_: 14924152832
2019-07-10 11:52:21.030199: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats: 
Limit:                  7462076416
InUse:                  5985544960
MaxInUse:               7395830528
NumAllocs:                  602436
MaxAllocSize:           1651557888

2019-07-10 11:52:21.030329: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ***********************************************************************************_________________
2019-07-10 11:52:21.030396: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at gpu_swapping_kernels.cc:72 : Resource exhausted: OOM when allocating tensor with shape[4000,100800] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "tf_agents/agents/ppo/examples/v2/train_eval_rnd.py", line 283, in <module>
    app.run(main)
  File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "tf_agents/agents/ppo/examples/v2/train_eval_rnd.py", line 278, in main
    num_eval_episodes=FLAGS.num_eval_episodes)
  File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/gin/config.py", line 1032, in wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
    six.raise_from(proxy.with_traceback(exception.__traceback__), None)
  File "<string>", line 3, in raise_from
  File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/gin/config.py", line 1009, in wrapper
    return fn(*new_args, **new_kwargs)
  File "tf_agents/agents/ppo/examples/v2/train_eval_rnd.py", line 231, in train_eval
    total_loss, _ = tf_agent.train(experience=trajectories)
  File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 429, in __call__
    return self._stateless_fn(*args, **kwds)
  File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1662, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 635, in _filtered_call
    self.captured_inputs)
  File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 733, in _call_flat
    outputs = self._inference_function.call(ctx, args)
  File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 459, in call
    ctx=ctx)
  File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted:  OOM when allocating tensor with shape[4,1000,210,160,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node epoch_0/normalize_observations/normalize_1/normalized_tensor/ArithmeticOptimizer/HoistCommonFactor_Add_add_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[Func/Losses/total_abs_loss/write_summary/summary_cond/then/_107/input/_222/_120]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted:  OOM when allocating tensor with shape[4,1000,210,160,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node epoch_0/normalize_observations/normalize_1/normalized_tensor/ArithmeticOptimizer/HoistCommonFactor_Add_add_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored. [Op:__inference_train_504298]

Function call stack:
train -> train

  In call to configurable 'train_eval' (<function train_eval at 0x7effb3c2ae18>)

I will try installing CPU-only version of TensorFlow and see if it still causes OOM error.

FYI, my VM spec is:

16 vCPUs
60 GB RAM
1 GPU (Tesla P4)

Sidenote: Running without RND (running train_eval.py instead of train_eval_rnd.py) with MontezumaRevenge-v0 makes it stuck? This is an unexpected behavior.

seungjaeryanlee commented 5 years ago

v4 envs are registered with max_episode_steps = 100000 which is 10x v0, this means it will take a long time to finish the episodes. Also the replay buffer capacity required goes up by 10x

Did not know about this! Thank you

seungjaeryanlee / agents

RND stuck at initialization for Montezuma's Revenge #9