Closed 3-body closed 1 year ago
you just need to rerun training and use --checkpoint
to specify the path to the checkpoint
you just need to rerun training and use
--checkpoint
to specify the path to the checkpoint
Thanks a lot for reply!
I tried many times again to use args '--checkpoint' to specify the interrupted training path 'output/Humanoid_11-10-10-43/nn/Humanoid.pth',
instead of resuming the interrupted training, it's create a new training directory 'output/Humanoid_16-20-24-34/nn/Humanoid.pth'.
Am I wrong? I don't know why.
I also found that I am unable to continue training the low level controller. --checkpoint or llc_checkpoint don't seem to load the previous training data. Wondering if there is an approach to fixing this a I am only training at certain times.
If you are training the low level controller and wish to continue it.
in common_agent.py in function 'train' just before self.obs = self.env_reset()
put: self.restore('output/Humanoid_23-09-45-35/nn/Humanoid.pth')
make sure to change the file above in restore to the one you wish to continue. A new file will be created but it will be a continuation of the one you want to restore.
If you are training the low level controller and wish to continue it.
in common_agent.py in function 'train' just before self.obs = self.env_reset()
put: self.restore('output/Humanoid_23-09-45-35/nn/Humanoid.pth')
make sure to change the file above in restore to the one you wish to continue. A new file will be created but it will be a continuation of the one you want to restore.
Thanks for your help, it's working perfectly! I also found a way to continue the previous training by adding args '--resume', looking like: 'python ase/run.py --resume 1 --task HumanoidAMPGetup --cfg_env ase/data/cfg/humanoid_ase_sword_shield_getup.yaml --cfg_train ase/data/cfg/train/rlg/ase_humanoid.yaml --motion_file ase/data/motions/reallusion_sword_shield/dataset_reallusion_sword_shield.yaml --checkpoint output/Humanoid_11-10-10-43/nn/Humanoid.pth --headless'. but also create a new tenserboard events logs.
The training was interrupted because it took too long, so how to resume an interrupted training from checkpoint path?