PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
In storage.py,
inserting obs at (self.step + 1) index, action at (self.step) index.
Then when we make batch, we get data (s_{t-1}, a_t, r_t ...).
Obs and action are different time step data.
Can I get some Intuition of this?
In storage.py, inserting obs at (self.step + 1) index, action at (self.step) index. Then when we make batch, we get data (s_{t-1}, a_t, r_t ...). Obs and action are different time step data. Can I get some Intuition of this?