steveKapturowski / tensorflow-rl

Implementations of deep RL papers and random experimentation
Apache License 2.0
177 stars 47 forks source link

Can't test CartPole-v0 model trained with TRPO #7

Open captify-alapite opened 7 years ago

captify-alapite commented 7 years ago

I've tried to use TRPO to create a model for CartPole-v0 by following the instructions on your OpenAI Gym page, changing the command to the following to reflect the api changes since the score was submitted:

python main.py CartPole-v0 --alg_type trpo --td_lambda 1.0 --cg_damping .05 --episodes_per_batch 25 -n 2 -v 0 --arch FC --trpo_max_rollout 1000 --max_kl .05 --history_length 1 --frame_skip 1 --activation tanh --num_epochs 40

This seems to work, with training proceeding as expected and concluding successfully. However, when I try to evaluate the trained model by running

python main.py CartPole-v0 --alg_type trpo -n 1 --test --restore_checkpoint

I get the following error.

[2017-05-25 16:16:16,587] Error reported to Coordinator: <type 'exceptions.ValueError'>, Cannot feed value of shape (1, 4, 4) for Tensor u'policy_network_0/input:0', which has shape '(?, 84, 84, 4)'
Process TRPOLearner-1:
Traceback (most recent call last):
  File "/home/abiolalapite/.pyenv/versions/2.7.13/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/abiolalapite/Code/ThirdParty/tensorflow-rl/algorithms/actor_learner.py", line 256, in run
    self.test()
  File "/home/abiolalapite/Code/ThirdParty/tensorflow-rl/algorithms/actor_learner.py", line 181, in test
    a = self.choose_next_action(s)[0]
  File "/home/abiolalapite/Code/ThirdParty/tensorflow-rl/algorithms/trpo_actor_learner.py", line 148, in choose_next_action
    return self.policy_network.get_action(self.session, state)
  File "/home/abiolalapite/Code/ThirdParty/tensorflow-rl/networks/policy_v_network.py", line 78, in get_action
    self.logits], feed_dict=feed_dict)
  File "/home/abiolalapite/.pyenv/versions/py2713/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
    run_metadata_ptr)
  File "/home/abiolalapite/.pyenv/versions/py2713/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 961, in _run
    % (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1, 4, 4) for Tensor u'policy_network_0/input:0', which has shape '(?, 84, 84, 4)'
steveKapturowski commented 7 years ago

The TRPO learner currently isn't being checkpointed like the other algorithms, but I'll try to get a fix for that in tonight. Also note, at present you would need to explicitly provide any flags that modify the architecture or agent behavior at test time as well, which in this case would be --arch FC --history_length 1 --activation tanh --frame_skip 1

For now I would recommend using the --use_monitor flag during training for any solved environments since the primary objective in those environments is to minimize the amount of training episodes until solve.

ph4m commented 7 years ago

Hey there, By any chance, do you still have plans for the following?

The TRPO learner currently isn't being checkpointed like the other algorithms, but I'll try to get a fix for that in tonight.

Thanks!

steveKapturowski commented 7 years ago

I do. I've been swamped with work recently so I forgot to take care of this but I should have time to fix it this weekend.