best way to resume training from PKL

richardrl commented 5 years ago

I would like to resume training from a given epoch. I guess I could add a line to:

def get_epoch_snapshot(self, epoch):
    data_to_save = dict(
        epoch=epoch,
        exploration_policy=self.exploration_policy,
        eval_policy=self.eval_policy,
       algorithm=self #Here 
    )
    if self.save_environment:
        data_to_save['env'] = self.training_env
    return data_to_save

and just pickle load the entire RLAlgorithm subclass instance to resume?

vitchyr commented 5 years ago

Yes, currently it's not implemented but your suggestions should work. After that just call train (and you'll probably want to change it so that you can pass in the epoch to start at). I think the annoying part will just be loading an existing progress.csv rather than overriding it (assume that's what you want to do)

yusukeurakami commented 5 years ago

@vitchyr I have a question related to training resume. I am trying to load the pre-trained SAC policy and re-train it. I loaded all five network (qf1, qf2, target_qf1, target_qf2, policy), and start training but policy_loss and qf_loss exploded (e+8~e+10). I thought it's because I forgot to save the "log_alpha" so I save/load it and re-start the training but still not working. It there any thought about this phenomenon? When I turn off the Auto Entropy Tuning function, I can resume with no problem.

vitchyr commented 5 years ago

Hmmm, it's a bit hard to say without knowing more. I imagine if you look at the values of alpha itself, it blows up rather quickly. There can only really be three reasons, since the alpha loss depends on three quantities:

log_pi. Are you loading highly off-policy data? Perhaps the log-likelihood of some old action is extremely low given the loaded policy.
self.target_entropy. Any chance the target entropy is something really wonky? For example, if you're doing discrete actions, then it should be positive.

richardrl commented 5 years ago

@yusukeurakami If you're using ADAM, you'll also want to reload the optimizer parameters or set the learning rate lower.

Is the Q function output exploding?

yusukeurakami commented 5 years ago

@vitchyr @richardrl Thank you for your reply. I'm now re-running the experiment to get those values.

yusukeurakami commented 5 years ago

@vitchyr @richardrl This is how actually the reward and the alpha looks like when I restart the training. As you see, alpha blows up immediately after when I resume the training and reward goes down and never comes back. qf1_loss and qf2_loss get crazy as well.

log_pi : I am doing domain randomization on my original environment, however, the environments that used when I resumed training is from the same distribution as last the first training. So, it shouldn't be that off-policy. I even saved and load the entire replay buffer and resume. It looks fine for 10 updates or so, but it went crazy laster as well.
self.target_entropy : My environment is continuous and there are 6 actuator. Self.target_entropy is -6 at any time. Does it sound right?
optimizer : Yes. I am using Adam, but I've already tried reloading the optimizer parameters. It mitigates the symptom a little but ended up in the same phenomenon.
Q-function output : As you see in the graph above, since qf1_loss is going crazy, I think that the qf values are also exploding.

F.Y.I.

I am saving the model with torch.save instead of pickle as following. I think this won't change any thing but just in case I am reporting it.
```
snapshot = self._get_snapshot()
torch.save(snapshot, save/path)
```

My other hyperparameter as follow.

if args.algo == 'sac':
    algorithm = "SAC"
variant = dict(
algorithm=algorithm,
version="normal",
layer_size=100,
replay_buffer_size=int(1E6),
algorithm_kwargs=dict(
    num_epochs=6000,
    num_eval_steps_per_epoch=512, #512
    num_trains_per_train_loop=1000, #1000
    num_expl_steps_per_train_loop=512, #512
    min_num_steps_before_training=512, #1000
    max_path_length=512, #512
    batch_size=128,
    ),
trainer_kwargs=dict(
    discount=0.99,
    soft_target_tau=5e-3,
    target_update_period=1,
    policy_lr=1E-3,
    qf_lr=1E-3,
    reward_scale=0.1,
    use_automatic_entropy_tuning=True,
    ),
)

I've changed some parameters in batch_rl_algorithm.py.

        num_train_loops_per_epoch=10,
        min_num_steps_before_training=0,

It will be great if you have any advice.

vitchyr commented 5 years ago

It looks like you set use_automatic_entropy_tuning=False in the original settings, but somehow the entropy tuning is set to True when you load it. How are you resuming training? Are you sure you're using the same hyperparameters for the restarted SAC?

On Tue, Aug 20, 2019 at 7:48 PM Yusuke Urakami notifications@github.com wrote:

@vitchyr https://github.com/vitchyr @richardrl https://github.com/richardrl This is how actually the reward and the alpha looks like when I restart the training. As you see, alpha blows up immediately after when I resume the training and reward goes down and never comes back. qf1_loss and qf2_loss get crazy as well.

[image: image] https://user-images.githubusercontent.com/22037436/63373099-2a800d80-c33c-11e9-9b09-92d06ceb13e9.png

[image: image] https://user-images.githubusercontent.com/22037436/63373557-fe18c100-c33c-11e9-9f5b-4eecf71901d5.png

[image: image] https://user-images.githubusercontent.com/22037436/63373601-11c42780-c33d-11e9-9843-db9e7a9147a6.png

1.

log_pi : I am doing domain randomization on my original environment, however, the environments that used when I resumed training is from the same distribution as last the first training. So, it shouldn't be that off-policy. I even saved and load the entire replay buffer and resume. It looks fine for 10 updates or so, but it went crazy laster as well. 2.

self.target_entropy : My environment is continuous and there are 6 actuator. Self.target_entropy is -6 at any time. Does it sound right? 3.

optimizer : Yes. I am using Adam, but I've already tried reloading the optimizer parameters. It mitigates the symptom a little but ended up in the same phenomenon. 4.

Q-function output : As you see in the graph above, since qf1_loss is going crazy, I think that the qf values are also exploding.

F.Y.I.

I am saving the model with torch.save instead of pickle as following. I think this won't change any thing but just in case I am reporting it.

snapshot = self._get_snapshot() torch.save(snapshot, save/path)
My other hyperparameter as follow. if args.algo == 'sac': algorithm = "SAC"

variant = dict( algorithm=algorithm, version="normal", layer_size=100, replay_buffer_size=int(1E6), algorithm_kwargs=dict( num_epochs=6000, num_eval_steps_per_epoch=512, #512 num_trains_per_train_loop=1000, #1000 num_expl_steps_per_train_loop=512, #512 min_num_steps_before_training=512, #1000 max_path_length=512, #512 batch_size=128, ), trainer_kwargs=dict( discount=0.99, soft_target_tau=5e-3, target_update_period=1, policy_lr=1E-3, qf_lr=1E-3, reward_scale=3.0, use_automatic_entropy_tuning=False, ), )
I've changed some parameters in batch_rl_algorithm.py.
    num_train_loops_per_epoch=10,
    min_num_steps_before_training=0,
It will be great if you have any advice.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vitchyr/rlkit/issues/33?email_source=notifications&email_token=AAJ4VZIXNEYLNIWA6GYYWJTQFQ37RA5CNFSM4GWLTSHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4XJM4Y#issuecomment-523146867, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJ4VZNHV7IDJGIKNLQTWXLQFQ37RANCNFSM4GWLTSHA .

-- Best,

Vitchyr

yusukeurakami commented 5 years ago

It looks like you set use_automatic_entropy_tuning=False in the original settings, but somehow the entropy tuning is set to True when you load it. How are you resuming training? Are you sure you're using the same hyperparameters for the restarted SAC?

Sorry I've pasted the wrong hyperparameters... I've edited on above but it was like follows. I am sure that I haven't changed my hyperparameters when I resumed the training.

        reward_scale=0.1,
        use_automatic_entropy_tuning=True,

vitchyr commented 5 years ago

Hmmm, yeah I don't know what could be different. Maybe check the optimizers are connected to the correct parameters? I could imagine getting into trouble if you load in parameters after creating the optimizers, though that would be quite a nasty design on PyTorch's part.

On Wed, Aug 21, 2019 at 8:12 AM Yusuke Urakami notifications@github.com wrote:

It looks like you set use_automatic_entropy_tuning=False in the original settings, but somehow the entropy tuning is set to True when you load it. How are you resuming training? Are you sure you're using the same hyperparameters for the restarted SAC? … <#m2645244245077380072> On Tue, Aug 20, 2019 at 7:48 PM Yusuke Urakami @.***> wrote: @vitchyr https://github.com/vitchyr https://github.com/vitchyr @richardrl https://github.com/richardrl https://github.com/richardrl This is how actually the reward and the alpha looks like when I restart the training. As you see, alpha blows up immediately after when I resume the training and reward goes down and never comes back. qf1_loss and qf2_loss get crazy as well. [image: image] https://user-images.githubusercontent.com/22037436/63373099-2a800d80-c33c-11e9-9b09-92d06ceb13e9.png [image: image] https://user-images.githubusercontent.com/22037436/63373557-fe18c100-c33c-11e9-9f5b-4eecf71901d5.png [image: image] https://user-images.githubusercontent.com/22037436/63373601-11c42780-c33d-11e9-9843-db9e7a9147a6.png

log_pi : I am doing domain randomization on my original environment, however, the environments that used when I resumed training is from the same distribution as last the first training. So, it shouldn't be that off-policy. I even saved and load the entire replay buffer and resume. It looks fine for 10 updates or so, but it went crazy laster as well. 2. self.target_entropy : My environment is continuous and there are 6 actuator. Self.target_entropy is -6 at any time. Does it sound right? 3. optimizer : Yes. I am using Adam, but I've already tried reloading the optimizer parameters. It mitigates the symptom a little but ended up in the same phenomenon. 4. Q-function output : As you see in the graph above, since qf1_loss is going crazy, I think that the qf values are also exploding. F.Y.I. 1. I am saving the model with torch.save instead of pickle as following. I think this won't change any thing but just in case I am reporting it. snapshot = self._get_snapshot() torch.save(snapshot, save/path) 1. My other hyperparameter as follow. if args.algo == 'sac': algorithm = "SAC" variant = dict( algorithm=algorithm, version="normal", layer_size=100, replay_buffer_size=int(1E6), algorithm_kwargs=dict( num_epochs=6000, num_eval_steps_per_epoch=512, #512 num_trains_per_train_loop=1000, #1000 num_expl_steps_per_train_loop=512,
512 min_num_steps_before_training=512, #1000 max_path_length=512, #512

batch_size=128, ), trainer_kwargs=dict( discount=0.99, soft_target_tau=5e-3, target_update_period=1, policy_lr=1E-3, qf_lr=1E-3, reward_scale=3.0, use_automatic_entropy_tuning=False, ), ) 1. I've changed some parameters in batch_rl_algorithm.py. num_train_loops_per_epoch=10, min_num_steps_before_training=0, It will be great if you have any advice. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33 https://github.com/vitchyr/rlkit/issues/33?email_source=notifications&email_token=AAJ4VZIXNEYLNIWA6GYYWJTQFQ37RA5CNFSM4GWLTSHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4XJM4Y#issuecomment-523146867>, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJ4VZNHV7IDJGIKNLQTWXLQFQ37RANCNFSM4GWLTSHA . -- Best, Vitchyr

Sorry I've pasted the wrong hyperparameters... I've edited on above but it was like follows. I am sure that I haven't changed my hyperparameters when I resumed the training.
    reward_scale=0.1,

    use_automatic_entropy_tuning=True,
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vitchyr/rlkit/issues/33?email_source=notifications&email_token=AAJ4VZLCQSYPUF6ISWPVVTTQFTTG5A5CNFSM4GWLTSHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4YVZOA#issuecomment-523328696, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJ4VZINXMAWH7WLJUZVOIDQFTTG5ANCNFSM4GWLTSHA .

-- Best,

Vitchyr

dwiel commented 4 years ago

@yusukeurakami you ever figure this out?

nanbaima commented 4 years ago

Does anyone knows if it would also be possible to resume an env and the pre-trained dataset from this env (with the same size state), but instead of using the previous reward, simply change it and continues to train the data set starting from the previous one (in other words: resume training but with new rewards to be added)?

rail-berkeley / rlkit

best way to resume training from PKL #33

512 min_num_steps_before_training=512, #1000 max_path_length=512, #512