Open skymanaditya1 opened 2 years ago
I tried debugging the issue quickly, but the problem seems to be emanating from the value of the parameter c.resume
in the train.py
file. Even though the flag training.resume=latest
is enabled, which should set the value of the field c.resume
to "latest", it isn't happening that way. Even with the training.resume=latest
flag set, the value of c.resume
is False. I fixed it for now by explicitly setting the value of c.resume
to True for now as I didn't want to go inside the configs too much myself for now.
I guess this is possible because when the file experiment_config.yaml
is being read, the value of c.resume
remains None. This is what the output of cfg.training looks like for me --
{'outdir': '${project_release_dir}', 'data': '${dataset.path}', 'gpus': '${num_gpus}', 'cfg': 'auto', 'snap': 50, 'kimg': 25000, 'metrics': ['fvd2048_16f', 'fvd2048_128f', 'fvd2048_128f_subsample8f', 'fid50k_full'], 'aug': 'ada', 'mirror': True, 'batch_size': 8, 'resume': None, 'seed': 0, 'dry_run': False, 'cond': False, 'subset': None, 'p': None, 'target': 0.6, 'augpipe': 'bgc', 'freezed': 0, 'fp32': False, 'nhwc': False, 'nobench': False, 'allow_tf32': False, 'num_workers': 3}
Hi, I am trying to resume training a model from a pretrained experiment using the command python
src/infra/launch.py hydra.run.dir=. exp_suffix=my_experiment_name env=local dataset=ffs dataset.resolution=256 num_gpus=4 training.resume=latest
.The model understands that it needs to resume training and prints the same - "We are going to resume the training and the experiment already exists. That's why the provided config/training_cmd are discarded and the project dir is not created". However, it attempts to recreate the output folder where all the intermediate checkpoints, inferred images, and videos are stored. And if I delete the output directory, it creates a new output directory but starts training from scratch which is weird.
Can you help me understand what I am doing wrong? Below is the full stack trace.