hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.16k stars 725 forks source link

[question] DDPG very slow #809

Closed C-monC closed 4 years ago

C-monC commented 4 years ago

Hello,

I am using a custom environment in pybullet that resets after 500 simulation steps.

I can go through 75000 simulation steps with PPO2 in about 30 mins but with DDPG it's approximately 150x slower. +-30 mins for 75k steps (PPO2) vs +-30 mins for 5k simulation steps. I am using both with a CnnPolicy.

Is this expected behavior from DDPG and is purely due to lack of multiprocessing?

Miffyli commented 4 years ago

Hard to say from this data alone, but I would go with the "lack of multiprocessing". If your environment is slow (or slow to reset), then PPO with multiple envs is going to be faster to train. As an algorithm DDPG is a bit slower, using replay buffers and all that, but it should not be this much slower.

C-monC commented 4 years ago

Thanks for the quick answer.

I'll do some investigating into environment run/reset times.

I notice a similar slowness (x100) when running SAC. Is PPO2 the only algorithm with multiprocessing?

Miffyli commented 4 years ago

Actually revert that, DDPG is "multi-processed" with MPI (if you can run it, then you have MPI installed).

While big jumps in training time, this is expected to an extent: DDPG and SAC store experiences into a replay buffer and run updates almost every step. This slows down the training speed in terms of steps-per-second, but generally makes agent more sample efficient. I suggest you continue training with DDPG/SAC as it is, and see how the agents improve, rather than steps-per-second. Sorry for the confusion.

C-monC commented 4 years ago

Okay cool, thanks. I'm leaving it running now and that seems to be the case. SAC has already passed the asymptotic performance of PPO2 in this instance.

araffin commented 4 years ago

Off-policy algorithms for continuous actions (SAC/DDPG/TD3) are slow compared to PPO/A2C, and it is not only because there are not multiprocesed, their gradient update take also more time. However, as mentioned by @Miffyli , with the correct hyperparameters (cf rl zoo), they usually outperform in term of sample efficiency (and also final performance) their on-policy counterpart (so A2C/TRPO/ACKTR/PPO...).

C-monC commented 4 years ago

Apologies for the additional questions, you both are extremely knowledgeable.

Is the gradient update slower because of the separation between the behavior policy and update policy or is it more complicated than that?

My basic understanding (from Sutton) is that on-policy algorithms (SARSA as an example from Sutton) are less likely to get themselves in a position where large negative rewards are possible and therefore may have less optimal solutions and off-policy algorithms (Q-learning) tend to try take more optimal paths. Is this where the performance gains come from?

In your opinion do you think these concepts are somewhat valid for more complex on-policy and off-policy algorithms? or valid at all, lol.

araffin commented 4 years ago

Hmm, we are talking about completely different things, we have links in the doc to learn more about the algorithms of Stable Baselines.