Open takerfume opened 6 years ago
this is not related to mujoco per se, but rather to a fact that mujoco uses value_network='copy' by default; and when creating a copy of a network, a new set of placeholders is created for lstm state and mask. As a workaround I'd suggest using --value_network=shared
flag (this way, policy and value networks will have a shared lstm cell with the same placeholders). I am looking into solving this issue in a more principled way.
Thank you! I understand that error means I didn't feed a value for placefolder of value net which is created by 'copying' policy net.
I run this command. And sucessfully train LSTM policy!
python -m baselines.run --alg=ppo2 --network=lstm --num_timesteps=1e6 --env=Reacher-v2 --num_env=4 --nminibatches=2 --value_network=shared
@pzhokhov Have you found that the 'copy' value network(not sharing parameters) produces better results on mujoco? Do you have any guess as to why this would be the case?
generally not sharing parameters makes training more stable (less sensitive to hyperparameters such as value function coefficient in the training objective or learning rate) because two different objectives do not compete with each other, whereas sharing parameters allows for faster learning (when it works). For image-based observations (and convolutional layers) we use parameter sharing , because otherwise both value function approximator and policy would have to learn good visual features, and that may take too many samples. Mujoco has simulator state-based observations that do not require much of feature learning; and not sharing parameters gets us training that works on decently on all environments without much hyperparameter tuning.
@pzhokhov is there any update so far? Still 'copy' value network is not supported for lstm with ppo2.
Hi, I think I discovered bug when I train LSTM Policy by PPO2 when mujoco env is selected.
I run this code.
python -m baselines.run --alg=ppo2 --env=Reacher-v2 --num_timesteps=1e6 --network=lstm --nminibatches=2 --num_env=4
and I get this error.
How can I train LSTM Policy by PPO2 in mujoco?
For your information, I can sucessfully train LSTM policy by PPO2 in PongNoFrameskip-v4.