sony / ai-research-code

Apache License 2.0
347 stars 65 forks source link

resuming training from checkpoint #59

Open jim79 opened 2 years ago

jim79 commented 2 years ago

Hi, How do we resume training on x-umx from a checkpoint? --checkpoint argument (as in umx) seems to be unrecognised Thank you

TE-BasavarajMurali commented 2 years ago

Hi @jim79, Please refer the below sample checkpoint utility functions. You may need to consider other factors like this LR scheduler which is non-trivial. We will offer this support soon,Thanks.

def load_checkpoint(checkpoint, solver):
    r"""Load the last states of the training."""
    print(f"Checkpoint Loading from: {str(path)}\n")

    with open(checkpoint, 'r') as file:
        info = json.load(file)
        path = Path(info['params_path'])
    nn.load_parameters(str(path / 'model.h5'))
    solver.load_states(str(path / 'solver_states.h5'))
    return info['cur_epoch']
def save_checkpoint(path, solver, cur_epoch):
    r"""Save the current states of the training."""
    path = Path(path)
    nn.save_parameters(str(path / 'model.h5'))
    solver.save_states(str(path / 'solver_states.h5'))
    with open(path / 'checkpoint.json', 'w') as f:
        json.dump(
            dict(cur_epoch=cur_epoch,
                 params_path=str(path),),
            f
        )
    print(f"Checkpoint saved: {str(path)}\n")
jim79 commented 2 years ago

Thank you for the response.