Closed seungjaeryanlee closed 5 years ago
More information: Running RND with LunarLander-v2
or CartPole-v0
works as intended.
python tf_agents/agents/ppo/examples/v2/train_eval_rnd.py \
--root_dir=$HOME/tmp/rndppo/gym/LunarLander-v2/ \
--logtostderr
python tf_agents/agents/ppo/examples/v2/train_eval_rnd.py \
--root_dir=$HOME/tmp/rndppo/gym/CartPole-v0/ \
--logtostderr --env_name=CartPole-v0
Reducing hyperparameters does not seem to have any effect :(
The variant MontezumaRevenge-v0
passes that step! It however outputs an OOM error.
A small fraction of OOM error log is below:
2019-07-10 11:52:21.030124: I tensorflow/core/common_runtime/bfc_allocator.cc:816] Sum Total of in-use chunks: 5.57GiB
2019-07-10 11:52:21.030129: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 7462076416 memory_limit_: 7462076416 available bytes: 0 curr_region_allocation_bytes_: 14924152832
2019-07-10 11:52:21.030199: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats:
Limit: 7462076416
InUse: 5985544960
MaxInUse: 7395830528
NumAllocs: 602436
MaxAllocSize: 1651557888
2019-07-10 11:52:21.030329: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ***********************************************************************************_________________
2019-07-10 11:52:21.030396: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at gpu_swapping_kernels.cc:72 : Resource exhausted: OOM when allocating tensor with shape[4000,100800] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "tf_agents/agents/ppo/examples/v2/train_eval_rnd.py", line 283, in <module>
app.run(main)
File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "tf_agents/agents/ppo/examples/v2/train_eval_rnd.py", line 278, in main
num_eval_episodes=FLAGS.num_eval_episodes)
File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/gin/config.py", line 1032, in wrapper
utils.augment_exception_message_and_reraise(e, err_str)
File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/gin/utils.py", line 49, in augment_exception_message_and_reraise
six.raise_from(proxy.with_traceback(exception.__traceback__), None)
File "<string>", line 3, in raise_from
File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/gin/config.py", line 1009, in wrapper
return fn(*new_args, **new_kwargs)
File "tf_agents/agents/ppo/examples/v2/train_eval_rnd.py", line 231, in train_eval
total_loss, _ = tf_agent.train(experience=trajectories)
File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 429, in __call__
return self._stateless_fn(*args, **kwds)
File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1662, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 635, in _filtered_call
self.captured_inputs)
File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 733, in _call_flat
outputs = self._inference_function.call(ctx, args)
File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 459, in call
ctx=ctx)
File "/home/seungjaeryanlee/anaconda3/envs/gsoc/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[4,1000,210,160,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node epoch_0/normalize_observations/normalize_1/normalized_tensor/ArithmeticOptimizer/HoistCommonFactor_Add_add_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[Func/Losses/total_abs_loss/write_summary/summary_cond/then/_107/input/_222/_120]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
(1) Resource exhausted: OOM when allocating tensor with shape[4,1000,210,160,3] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node epoch_0/normalize_observations/normalize_1/normalized_tensor/ArithmeticOptimizer/HoistCommonFactor_Add_add_1}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored. [Op:__inference_train_504298]
Function call stack:
train -> train
In call to configurable 'train_eval' (<function train_eval at 0x7effb3c2ae18>)
I will try installing CPU-only version of TensorFlow and see if it still causes OOM error.
FYI, my VM spec is:
16 vCPUs
60 GB RAM
1 GPU (Tesla P4)
Sidenote: Running without RND (running train_eval.py
instead of train_eval_rnd.py
) with MontezumaRevenge-v0
makes it stuck? This is an unexpected behavior.
v4 envs are registered with
max_episode_steps = 100000
which is 10x v0, this means it will take a long time to finish the episodes. Also the replay buffer capacity required goes up by 10x
Did not know about this! Thank you
I created a VM instance and setup TF-Agents to evaluate my RND. However, it has been stuck for over an hour with just the following output:
Command:
End of Output:
Ctrl+C also doesn't seem to kill it, so I had to use
kill
command. Are you aware of any particular causes for this issue?