Closed blurLake closed 4 years ago
Similar to #352, what is the definition of "epochs" here?
Thank you for your reply! I meant one episode as one a sequence of states, actions and rewards, which ends with terminal state. I just wonder if we can set the length of this sequence in DDPG algorithm, something like 20, which means the agent can only interact with the environment for 20 steps. And we reset the environment after 20 steps, and repeat so on.
I am still rather uncertain what is it you want to achieve, exactly. The naming of DDPG parameters can be bit vague: nb_rollout_steps
means how many steps we take in the environment, before we do nb_train_steps
updates to the network, followed by nb_evaluation_steps
of evaluating the agent.
Thanks for helping me understand the parameters. It is getting closer. I attached a screenshot of DDPG alg from original paper (https://arxiv.org/pdf/1509.02971.pdf). I am seeking is how to set "T" in the pseudocode. Thanks!
T usually, and in this case, signifies the end of the episode. So the action selection, storing, network optimisation and target update occurs once per environment step. So when the episode has finished, the noise and the environment are reset. This is done here:
and here:
Thanks for the reply! Exactly what I am asking for! If I understand correctly, ddpg in stable_baselines can only end of the episode if done is True, which in some cases means the reward reaches its maximum or the policy is finely tuned. I feel this is slightly different from the original algorithms, which can terminate the episode after fixed number of steps, T, without caring the reward or policies.
Especially, for some complex environment, it might take really long time till "done" is True. Is there a way to predefine the length of episodes in my script (not changing stable-baselines/stable_baselines/ddpg/ddpg.py).
Looking forward to the comments!
The done
signal is just the end of an episode. Usually (e.g. Pendulum-v0
or the pybullets envs), the episode has a time limit and will trigger done=True
after that limit. However, if you do so, you need to add a time feature not to break the markov property.
Current algorithms in stable-baselines are step-based (instead of episode based), so they will explore during n steps (it is called a rollout) and then update the policy parameters (using one or several gradient steps).
I recommend you to read SAC or TD3 (successor of DDPG) code which is clearer than the original DDPG code.
Alright, thanks a lot @Solliet @Miffyli @araffin . I will try with TD3 and SAC.
Hi, I saw in openai spinups
which specifies the number of steps in each episode/epoch. Is there a similar setting in stable_baselines? Thanks!