nv-tlabs / ASE

Other
795 stars 130 forks source link

How to resume an interrupted training from checkpoint path? #38

Closed 3-body closed 1 year ago

3-body commented 1 year ago

The training was interrupted because it took too long, so how to resume an interrupted training from checkpoint path?

xbpeng commented 1 year ago

you just need to rerun training and use --checkpoint to specify the path to the checkpoint

3-body commented 1 year ago

you just need to rerun training and use --checkpoint to specify the path to the checkpoint

Thanks a lot for reply! I tried many times again to use args '--checkpoint' to specify the interrupted training path 'output/Humanoid_11-10-10-43/nn/Humanoid.pth', instead of resuming the interrupted training, it's create a new training directory 'output/Humanoid_16-20-24-34/nn/Humanoid.pth'.
Am I wrong? I don't know why.

Robokan commented 1 year ago

I also found that I am unable to continue training the low level controller. --checkpoint or llc_checkpoint don't seem to load the previous training data. Wondering if there is an approach to fixing this a I am only training at certain times.

Robokan commented 1 year ago

If you are training the low level controller and wish to continue it.

in common_agent.py in function 'train' just before self.obs = self.env_reset()

put: self.restore('output/Humanoid_23-09-45-35/nn/Humanoid.pth')

make sure to change the file above in restore to the one you wish to continue. A new file will be created but it will be a continuation of the one you want to restore.

3-body commented 1 year ago

If you are training the low level controller and wish to continue it.

in common_agent.py in function 'train' just before self.obs = self.env_reset()

put: self.restore('output/Humanoid_23-09-45-35/nn/Humanoid.pth')

make sure to change the file above in restore to the one you wish to continue. A new file will be created but it will be a continuation of the one you want to restore.

Thanks for your help, it's working perfectly! I also found a way to continue the previous training by adding args '--resume', looking like: 'python ase/run.py --resume 1 --task HumanoidAMPGetup --cfg_env ase/data/cfg/humanoid_ase_sword_shield_getup.yaml --cfg_train ase/data/cfg/train/rlg/ase_humanoid.yaml --motion_file ase/data/motions/reallusion_sword_shield/dataset_reallusion_sword_shield.yaml --checkpoint output/Humanoid_11-10-10-43/nn/Humanoid.pth --headless'. but also create a new tenserboard events logs.