Open wenjunli-0 opened 3 years ago
The exact answer depends on the algorithm you use, but at least with DQN the code re-creates the replay buffer on every call to learn
.
However in stable-baselines3 the buffer is not re-created, so calling learn
again would use the samples from the previous learn
call as well.
The exact answer depends on the algorithm you use, but at least with DQN the code re-creates the replay buffer on every call to
learn
.However in stable-baselines3 the buffer is not re-created, so calling
learn
again would use the samples from the previouslearn
call as well.
Thanks for your swift response. I am using TRPO and PPO. So, you mean stable-baselines3 would be more suitable for this problem (because stable-baselines3 will collect previous samples and current samples in buffer), right?
Thanks for your swift response. I am using TRPO and PPO. So, you mean stable-baselines3 would be more suitable for this problem (because stable-baselines3 will collect previous samples and current samples in buffer), right?
I would recommend using SB3 in any case (unless you really need TRPO), as it is more up-to-date and is actively supported/maintained :)
But: if you are using TRPO/PPO, then the answer to your original question is "no". These algorithms use a rollout buffer to collect samples, which are then discarded after they have been used to update the policy, so no samples are retained for a longer time (this is a "feature" of these algorithms).
Thanks for your swift response. I am using TRPO and PPO. So, you mean stable-baselines3 would be more suitable for this problem (because stable-baselines3 will collect previous samples and current samples in buffer), right?
I would recommend using SB3 in any case (unless you really need TRPO), as it is more up-to-date and is actively supported/maintained :)
But: if you are using TRPO/PPO, then the answer to your original question is "no". These algorithms use a rollout buffer to collect samples, which are then discarded after they have been used to update the policy, so no samples are retained for a longer time (this is a "feature" of these algorithms).
Okay, I will stick to SB3 in my later experiments. There are A2C, DDPG, DQN, HER, PPO, SAC, TD3 in SB3, could you please point out the algorithms that support this continuous training feature for me. I am not that familiar with some of the algorithms, so your explicit answer would be a great help for me.
I think any algorithm with replay buffer should work like this, so: DDPG, DQN, SAC and TD3.
I am using stable baseline and I want to train an agent with varying environments, i.e. the environment hyper-parameter is adjusted every 1000 timestep.
Describe the bug I want to know if I resume training this way, whether the previous interaction experience will be automatically used in current training. With the i increases, will the model have access to the larger experience space in the buffer?
If not, could you please let me know how can I do this with stable baselinse?Thanks.