openai / baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
MIT License
15.75k stars 4.87k forks source link

What's the pros and cons of using the entire episodes to train an PPO with LSTM at each step? #1026

Open xlnwel opened 4 years ago

xlnwel commented 4 years ago

In this code, I spotted that PPO with LSTM uses the entire trajectories for each gradient descent step, where input data is of shape [envsperbatch, nsteps, ]. I'm wondering if this is a good practice for long trajectories? Why not truncate the trajectories, i.e. for every gradient step, we use batches of shape [n_envs, stepsperbatch, ](of course, we should keep the intermediate states of LSTM in this case)? What's the pros and cons of each method?

gomlfx commented 4 years ago

Hi guys can you tell me if there's a PPO regressor I can call like the scikit LGBMRegressor() ?

On Thu, Oct 31, 2019, 6:47 PM The Raven Chaser, notifications@github.com wrote:

In this code https://github.com/openai/baselines/blob/665b888eeb688396894455a0d94febc4f712e0c0/baselines/ppo2/ppo2.py#L174, I spotted that PPO with LSTM uses the entire trajectories for each gradient descent step, where input data is of shape [envsperbatch, nsteps, ]. I'm wondering if this is a good practice for long trajectories? Why not truncate the trajectories, i.e. for every gradient step, we use batches of shape [n_envs, stepsperbatch, ](of course, we should keep the intermediate states of LSTM in this case)? What's the pros and cons of each method?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openai/baselines/issues/1026?email_source=notifications&email_token=AIVELNFH5FJAW6GFM2AYC6LQRODENA5CNFSM4JHVAEC2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HV7N7BQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIVELNGUQ7OY3FLD7DDIDQ3QRODENANCNFSM4JHVAECQ .

christopherhesse commented 4 years ago

Won't using stepsperbatch instead of nsteps change the amount of LSTM unrolling, since you don't backprop through those intermediate states?

christopherhesse commented 4 years ago

Also I think the trajectories are already truncated, * in the shape you mentioned should be a constant regardless of the length of the actual trajectories (time for an entire episode).

xlnwel commented 4 years ago

I actually intend to truncate the unrolling. To be my best knowledge, most RNN structures, even LSTM, are not good at capturing long-time dependencies. Therefore, I doubt if it is necessary to unroll LSTM for thousands of steps. The reason I want to use a smaller sequence length is that I can use a larger batch size to gain some speed-up. However, my latest attempt to do so did not end up with a good performance:-( That was why I asked the question. BTW, the environment I used to test my code is BipedalWalker-v2 from GYM, whose maximum episodic steps is 1600(but the actual length could be much smaller thanks to a done signal)

christopherhesse commented 4 years ago

You can set nsteps to a smaller number, which will truncate the unrolling. If you need higher GPU utilization you can lower nminibatches or use more parallel environments.

xlnwel commented 4 years ago

Hi, thanks for response. I know how to do that, but I found that it impaired the performance somehow(I'm not so sure if it is because of some issue of my implementation)...Therefore, I'm wondering if this is a good practice to truncate the sequence length. For example, in OpenAI Five, is the sequence length truncated to a smaller size? How does it handle the scenario where trajectories have different sequence lengths? I personally tried to pad with zeros, but some sequences may be much more shorter than others, which makes me thinking if such a padding mechanism is a good choice.