hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.17k stars 725 forks source link

Retraining model after loading not working while following #30 #950

Closed aadnesd closed 3 years ago

aadnesd commented 4 years ago

Describe the bug My code is ran like this, which for me looks as the same method as #30 image

However when looking at tensorboard the model that's retraining doesn't reach the same value for episode rewards until 100k timesteps and for discounted reward still not (curr at 1.2mill). This is reoccuring for my training so I'm wondering if you see what I'm doing wrong?

image The orange one is the retrained model image

System Info Describe the characteristic of your environment:

Additional context Add any other context about the problem here.

Miffyli commented 4 years ago

Code seems ok. Couple of ideas to check:

aadnesd commented 4 years ago

The environments has random starting points, targets and hazards. Should I still check the first idea?

Miffyli commented 4 years ago

Yes, please check the first part if the performance matches after saving and loading into a new environment.

aadnesd commented 4 years ago

I don't think I quite understand what you mean. This first one is the first result when testing the model straight after training. image And this is after loading the same model and not doing any training, just testing it. image And this a third time image But because of the randomness the envirnoments the model is tested on can be more difficult from one to another

Miffyli commented 4 years ago

You are doing it correct. I am uncertain what these numbers mean, i.e. what is a good reward. For simplicity, you should compare the sum of rewards from the episodes. You need to run many episodes to get good average results, e.g. 100 episodes should be a good amount at least for a start.

mpgussert commented 3 years ago

Is it possible this is yet another example of checkpoint loading failure? The data saved to the checkpoint files created via save or checkpoint callback only contain model weights. SB3 is missing significant data from it's checkpoint system (optimizer state, replay buffer, learning rate schedule, etc...). You currently cannot properly continue training without manually saving these, and what to save requires specific knowledge about the optimizer and the algorithm being run :/

araffin commented 3 years ago

SB3 is missing significant data from it's checkpoint system (optimizer state, replay buffer, learning rate schedule, etc...).

what makes you think that? optimizer state and learning rate schedules are saved by default (by it can be tricky for schedule, see https://github.com/HumanCompatibleAI/imitation/issues/262) and replay buffer can be saved separately (cf. doc, also included in the rl zoo: https://github.com/DLR-RM/rl-baselines3-zoo)

You currently cannot properly continue training without manually saving these, and what to save requires specific knowledge about the optimizer and the algorithm being run :/

you can (except for the schedule which will be reset, but that's normal as you cannot know in advance if the user will call learn() several times or not, see discussion) and the rl zoo allows to do that easily.

mpgussert commented 3 years ago

I don't have concrete evidence on me, but I think this because no matter what i seem to do, training always collapses after loading a model, almost exactly as described in this thread. I assumed that this was just a consequence of RL in general but in combing the issues for both SB2 and SB3 i see that this is a common, reoccurring problem. I know that in SB2 things like optimizer parameters and replay buffers were not saved automatically.

It should be easy enough to test though, right? if you fix the seed and train for X + Y iterations and note the result, then reset everything and train for X iterations, then load the checkpoint and train for Y iterations, you should get the exact same result up to machine error (because we are using pseudo randomness and the seed is fixed).

I am going to try and write this test. I will post my results here :)

araffin commented 3 years ago

, but I think this because no matter what i seem to do, training always collapses after loading a model, almost exactly as described in this thread.

I'm personally using the saving/loading feature of SB3 to train real robots over multiple days. And I did not experience any performance drop after reloading yet. The only thing that may go wrong is the learning schedule.

So I did a quick check (using the rl zoo: https://github.com/DLR-RM/rl-baselines3-zoo):

Train SAC on Pendulum for 5k steps

python train.py --algo sac --env Pendulum-v0 -n 5000 --save-replay-buffer --eval-freq 5000 --num-threads 2

The last evaluation says Eval num_timesteps=5000, episode_reward=-441.95 +/- 323.26

Continue training:

python train.py --algo sac --env Pendulum-v0 -n 5000 --save-replay-buffer --eval-freq 1000 --num-threads 2 -i logs/sac/Pendulum-v0_1/Pendulum-v0.zip

The first evaluation, after 1k steps says: Eval num_timesteps=1000, episode_reward=-180.16 +/- 101.10 (no performance drop, even an improvement) (and doing 10k steps give you similar results)

mpgussert commented 3 years ago

I've been going back through my old code and i'm realizing that a lot of the problems i have had were in SB2 and baselines, not SB3. In tinkering more with the SB3 release, i haven't had any problems continuing training.

as for that test i want to run, while we can set the seed we would need to either execute the same number of calls to the generator before starting the second round of training, or we would need to save the generator state. However, at this point I'm confident i was just mistaken, so i don't feel the need to follow through on that test.

so my conclusion is I'm a dummy! sorry for the trouble and bother.

araffin commented 3 years ago

Should be fixed in SB3: https://github.com/DLR-RM/stable-baselines3/issues/43

rambo1111 commented 10 months ago

https://github.com/hill-a/stable-baselines/issues/1192