Variable length episodes

quantumiracle / Popular-RL-Algorithms

PyTorch implementation of Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), Actor-Critic (AC/A2C), Proximal Policy Optimization (PPO), QT-Opt, PointNet..

Apache License 2.0

1.09k stars 123 forks source link

Hi, I think what you said is correct. For lstm version, the whole episode of transitions are sent to train the policy as a single sample. Variate length of episodes can be achieved with two ways: 1. padding consecutive transitions with zeros in episode if it has a different length from other episodes (the maximal one), however, the zero padding may not make sense in some cases; 2. change the update manner of lstm policy to take one transition at a time but keep the gradients of hidden states not detached, so that it can be tracked along the episode. Also welcome to contribute some other implementations for achieving that.

quantumiracle / Popular-RL-Algorithms

Variable length episodes #20