PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
I was taking a look at your code and wondering if you tackle the stale hidden states after each rollout. As I have seen, the code is used in order to be stateful at episode level, and then, when done is found, the hidden states are reset. However, from one rollout to another, the output hidden state of the last rollout is copied in order to be the input hidden state of the current rollout, although the actor-critic network parameters (including GRU) have already been updated.
Is there any reason why you do not recalculate the last rollouts hidden state taking into account the new network weights?
Thank you in advance!
Hi!
I was taking a look at your code and wondering if you tackle the stale hidden states after each rollout. As I have seen, the code is used in order to be stateful at episode level, and then, when done is found, the hidden states are reset. However, from one rollout to another, the output hidden state of the last rollout is copied in order to be the input hidden state of the current rollout, although the actor-critic network parameters (including GRU) have already been updated.
Is there any reason why you do not recalculate the last rollouts hidden state taking into account the new network weights? Thank you in advance!