Closed aadnesd closed 3 years ago
Code seems ok. Couple of ideas to check:
learn
starts with new optimizer parameters, which in the case of Adam (default) can ruin the initial parameters if the learning rate is too high. However it should still learn faster than learning from scratch as it should obtain some good trajectories right from the start (based on intuition here).The environments has random starting points, targets and hazards. Should I still check the first idea?
Yes, please check the first part if the performance matches after saving and loading into a new environment.
I don't think I quite understand what you mean. This first one is the first result when testing the model straight after training. And this is after loading the same model and not doing any training, just testing it. And this a third time But because of the randomness the envirnoments the model is tested on can be more difficult from one to another
You are doing it correct. I am uncertain what these numbers mean, i.e. what is a good reward. For simplicity, you should compare the sum of rewards from the episodes. You need to run many episodes to get good average results, e.g. 100 episodes should be a good amount at least for a start.
Is it possible this is yet another example of checkpoint loading failure? The data saved to the checkpoint files created via save or checkpoint callback only contain model weights. SB3 is missing significant data from it's checkpoint system (optimizer state, replay buffer, learning rate schedule, etc...). You currently cannot properly continue training without manually saving these, and what to save requires specific knowledge about the optimizer and the algorithm being run :/
SB3 is missing significant data from it's checkpoint system (optimizer state, replay buffer, learning rate schedule, etc...).
what makes you think that? optimizer state and learning rate schedules are saved by default (by it can be tricky for schedule, see https://github.com/HumanCompatibleAI/imitation/issues/262) and replay buffer can be saved separately (cf. doc, also included in the rl zoo: https://github.com/DLR-RM/rl-baselines3-zoo)
You currently cannot properly continue training without manually saving these, and what to save requires specific knowledge about the optimizer and the algorithm being run :/
you can (except for the schedule which will be reset, but that's normal as you cannot know in advance if the user will call learn()
several times or not, see discussion) and the rl zoo allows to do that easily.
I don't have concrete evidence on me, but I think this because no matter what i seem to do, training always collapses after loading a model, almost exactly as described in this thread. I assumed that this was just a consequence of RL in general but in combing the issues for both SB2 and SB3 i see that this is a common, reoccurring problem. I know that in SB2 things like optimizer parameters and replay buffers were not saved automatically.
It should be easy enough to test though, right? if you fix the seed and train for X + Y iterations and note the result, then reset everything and train for X iterations, then load the checkpoint and train for Y iterations, you should get the exact same result up to machine error (because we are using pseudo randomness and the seed is fixed).
I am going to try and write this test. I will post my results here :)
, but I think this because no matter what i seem to do, training always collapses after loading a model, almost exactly as described in this thread.
I'm personally using the saving/loading feature of SB3 to train real robots over multiple days. And I did not experience any performance drop after reloading yet. The only thing that may go wrong is the learning schedule.
So I did a quick check (using the rl zoo: https://github.com/DLR-RM/rl-baselines3-zoo):
Train SAC on Pendulum for 5k steps
python train.py --algo sac --env Pendulum-v0 -n 5000 --save-replay-buffer --eval-freq 5000 --num-threads 2
The last evaluation says Eval num_timesteps=5000, episode_reward=-441.95 +/- 323.26
Continue training:
python train.py --algo sac --env Pendulum-v0 -n 5000 --save-replay-buffer --eval-freq 1000 --num-threads 2 -i logs/sac/Pendulum-v0_1/Pendulum-v0.zip
The first evaluation, after 1k steps says: Eval num_timesteps=1000, episode_reward=-180.16 +/- 101.10
(no performance drop, even an improvement)
(and doing 10k steps give you similar results)
I've been going back through my old code and i'm realizing that a lot of the problems i have had were in SB2 and baselines, not SB3. In tinkering more with the SB3 release, i haven't had any problems continuing training.
as for that test i want to run, while we can set the seed we would need to either execute the same number of calls to the generator before starting the second round of training, or we would need to save the generator state. However, at this point I'm confident i was just mistaken, so i don't feel the need to follow through on that test.
so my conclusion is I'm a dummy! sorry for the trouble and bother.
Should be fixed in SB3: https://github.com/DLR-RM/stable-baselines3/issues/43
Describe the bug My code is ran like this, which for me looks as the same method as #30
However when looking at tensorboard the model that's retraining doesn't reach the same value for episode rewards until 100k timesteps and for discounted reward still not (curr at 1.2mill). This is reoccuring for my training so I'm wondering if you see what I'm doing wrong?
The orange one is the retrained model
System Info Describe the characteristic of your environment:
Additional context Add any other context about the problem here.