Resume Training with Previous Experience (state-action-state')?

hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

http://stable-baselines.readthedocs.io/

MIT License

4.14k stars 723 forks source link

Resume Training with Previous Experience (state-action-state')? #1134

Open wenjunli-0 opened 3 years ago

wenjunli-0 commented 3 years ago

I am using stable baseline and I want to train an agent with varying environments, i.e. the environment hyper-parameter is adjusted every 1000 timestep.

for i in range(100):
    a = i * 2
    env = CustomizedEnv(parameter=a)
    env.reset()

    model.learn(total_timesteps=1000, reset_num_timesteps=False)
    model.save(save_dir + 'timestep_{}'.format(i))

Describe the bug I want to know if I resume training this way, whether the previous interaction experience will be automatically used in current training. With the i increases, will the model have access to the larger experience space in the buffer?

If not, could you please let me know how can I do this with stable baselinse？Thanks.

Miffyli commented 3 years ago

The exact answer depends on the algorithm you use, but at least with DQN the code re-creates the replay buffer on every call to learn.

However in stable-baselines3 the buffer is not re-created, so calling learn again would use the samples from the previous learn call as well.

wenjunli-0 commented 3 years ago

The exact answer depends on the algorithm you use, but at least with DQN the code re-creates the replay buffer on every call to learn.

However in stable-baselines3 the buffer is not re-created, so calling learn again would use the samples from the previous learn call as well.

Thanks for your swift response. I am using TRPO and PPO. So, you mean stable-baselines3 would be more suitable for this problem (because stable-baselines3 will collect previous samples and current samples in buffer), right?

Miffyli commented 3 years ago

Thanks for your swift response. I am using TRPO and PPO. So, you mean stable-baselines3 would be more suitable for this problem (because stable-baselines3 will collect previous samples and current samples in buffer), right?

I would recommend using SB3 in any case (unless you really need TRPO), as it is more up-to-date and is actively supported/maintained :)

But: if you are using TRPO/PPO, then the answer to your original question is "no". These algorithms use a rollout buffer to collect samples, which are then discarded after they have been used to update the policy, so no samples are retained for a longer time (this is a "feature" of these algorithms).

wenjunli-0 commented 3 years ago

Thanks for your swift response. I am using TRPO and PPO. So, you mean stable-baselines3 would be more suitable for this problem (because stable-baselines3 will collect previous samples and current samples in buffer), right?

I would recommend using SB3 in any case (unless you really need TRPO), as it is more up-to-date and is actively supported/maintained :)

But: if you are using TRPO/PPO, then the answer to your original question is "no". These algorithms use a rollout buffer to collect samples, which are then discarded after they have been used to update the policy, so no samples are retained for a longer time (this is a "feature" of these algorithms).

Okay, I will stick to SB3 in my later experiments. There are A2C, DDPG, DQN, HER, PPO, SAC, TD3 in SB3, could you please point out the algorithms that support this continuous training feature for me. I am not that familiar with some of the algorithms, so your explicit answer would be a great help for me.

Miffyli commented 3 years ago

I think any algorithm with replay buffer should work like this, so: DDPG, DQN, SAC and TD3.

rambo1111 commented 8 months ago

https://github.com/hill-a/stable-baselines/issues/1192