Training LSTMs involves lots of data transformation

ernestum commented 5 years ago

I looked at how exactly LSTMs are trained with PPO2 and found that a lot of unnecessary data transformations happen:

Trajectories are sampled by the Runner. At the end of its run method data is flattened from the shape [num_steps, num_envs, x] to [num_steps * num_envs, x] after switching the first two dimensions.
```
arr.swapaxes(0, 1).reshape(shape[0] * shape[1], *shape[2:])
```
In the learn method of PPO2 a very hard to understand mechanism is used to shuffle the sampled trajectory data without mixing up states adjacent in time. It uses a very large flat_indices array.
```
flat_indices = np.arange(self.n_envs * self.n_steps).reshape(self.n_envs, self.n_steps)
```
In the optimization step, the data needs to be disentangled again using the batch_to_seq function, which reconstructs the data in the shape [n_steps, num_envs, x] again so we can build an LSTM graph (by the way this is the format in which the trajectories were sampled to begin with in step 1).
```
input_sequence = batch_to_seq(extracted_features, self.n_env, n_steps)
```
For further processing, the data is converted back to the flat version using seq_to_batch
```
rnn_output = seq_to_batch(rnn_output)
```
All this seems to be overly complex and potentially slow to me. This is why I would like to open the discussion here on how matters could be improved. Please set your ideas free :-)

ernestum commented 5 years ago

My first thought was that the runner should keep the data untouched and we should feed it to the policy in the format [num_steps, num_envs, x]:

This saves a lot of preprocessing for the LSTM case
In PPO2 (and maybe other algos) the nasty distinction between recurrent and non-recurrent policies can go away.
The burden of flattening is placed on the Feedforward policies (we break compatibility with any custom ActorCriticPolicys)
Is this easy to implement for other algos than PPO2?
Since we are not allowed to reorder the samples within one trajectory in case we train a recurrent policy, we can not mix the training data as throughly as before. For me it is hard to guess how much this would decrease performance.

What do you think?

araffin commented 5 years ago

Yes, I completely agree that LSTM code is overcomplicated (and that is also the reason I avoid using recurrent policies for now ^^"...). However, I need a bit more time to give you insightful feedback. Ping me again in two weeks if I didn't answer you ;)

araffin commented 5 years ago

Referencing that PR here: https://github.com/openai/baselines/pull/859

hill-a / stable-baselines

Training LSTMs involves lots of data transformation #158