Open ernestum opened 5 years ago
My first thought was that the runner should keep the data untouched and we should feed it to the policy in the format [num_steps, num_envs, x]:
ActorCriticPolicy
s)What do you think?
Yes, I completely agree that LSTM code is overcomplicated (and that is also the reason I avoid using recurrent policies for now ^^"...). However, I need a bit more time to give you insightful feedback. Ping me again in two weeks if I didn't answer you ;)
Referencing that PR here: https://github.com/openai/baselines/pull/859
I looked at how exactly LSTMs are trained with PPO2 and found that a lot of unnecessary data transformations happen:
[num_steps, num_envs, x]
to[num_steps * num_envs, x]
after switching the first two dimensions.All this seems to be overly complex and potentially slow to me. This is why I would like to open the discussion here on how matters could be improved. Please set your ideas free :-)