hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.17k stars 725 forks source link

steps_per_epoch in DDPG. #776

Closed blurLake closed 4 years ago

blurLake commented 4 years ago

Hi, I saw in openai spinups

spinup.ddpg_tf1(..., steps_per_epoch=4000, epochs=100, ...)

which specifies the number of steps in each episode/epoch. Is there a similar setting in stable_baselines? Thanks!

Miffyli commented 4 years ago

Similar to #352, what is the definition of "epochs" here?

blurLake commented 4 years ago

Thank you for your reply! I meant one episode as one a sequence of states, actions and rewards, which ends with terminal state. I just wonder if we can set the length of this sequence in DDPG algorithm, something like 20, which means the agent can only interact with the environment for 20 steps. And we reset the environment after 20 steps, and repeat so on.

Miffyli commented 4 years ago

I am still rather uncertain what is it you want to achieve, exactly. The naming of DDPG parameters can be bit vague: nb_rollout_steps means how many steps we take in the environment, before we do nb_train_steps updates to the network, followed by nb_evaluation_steps of evaluating the agent.

blurLake commented 4 years ago

Thanks for helping me understand the parameters. It is getting closer. I attached a screenshot of DDPG alg from original paper (https://arxiv.org/pdf/1509.02971.pdf). Screenshot 2020-03-31 at 18 38 51 I am seeking is how to set "T" in the pseudocode. Thanks!

m-rph commented 4 years ago

T usually, and in this case, signifies the end of the episode. So the action selection, storing, network optimisation and target update occurs once per environment step. So when the episode has finished, the noise and the environment are reset. This is done here:

https://github.com/hill-a/stable-baselines/blob/950c2a5bf95a9fa908be26fd5db11aa60cfa2b2a/stable_baselines/ddpg/ddpg.py#L831-L847

and here:

https://github.com/hill-a/stable-baselines/blob/950c2a5bf95a9fa908be26fd5db11aa60cfa2b2a/stable_baselines/ddpg/ddpg.py#L934-L951

blurLake commented 4 years ago

Thanks for the reply! Exactly what I am asking for! If I understand correctly, ddpg in stable_baselines can only end of the episode if done is True, which in some cases means the reward reaches its maximum or the policy is finely tuned. I feel this is slightly different from the original algorithms, which can terminate the episode after fixed number of steps, T, without caring the reward or policies.

Especially, for some complex environment, it might take really long time till "done" is True. Is there a way to predefine the length of episodes in my script (not changing stable-baselines/stable_baselines/ddpg/ddpg.py).

Looking forward to the comments!

araffin commented 4 years ago

The done signal is just the end of an episode. Usually (e.g. Pendulum-v0 or the pybullets envs), the episode has a time limit and will trigger done=True after that limit. However, if you do so, you need to add a time feature not to break the markov property. Current algorithms in stable-baselines are step-based (instead of episode based), so they will explore during n steps (it is called a rollout) and then update the policy parameters (using one or several gradient steps). I recommend you to read SAC or TD3 (successor of DDPG) code which is clearer than the original DDPG code.

blurLake commented 4 years ago

Alright, thanks a lot @Solliet @Miffyli @araffin . I will try with TD3 and SAC.