Resume training causes output folder File exists error

Hi, I am trying to resume training a model from a pretrained experiment using the command python src/infra/launch.py hydra.run.dir=. exp_suffix=my_experiment_name env=local dataset=ffs dataset.resolution=256 num_gpus=4 training.resume=latest.

The model understands that it needs to resume training and prints the same - "We are going to resume the training and the experiment already exists. That's why the provided config/training_cmd are discarded and the project dir is not created". However, it attempts to recreate the output folder where all the intermediate checkpoints, inferred images, and videos are stored. And if I delete the output directory, it creates a new output directory but starts training from scratch which is weird.

Can you help me understand what I am doing wrong? Below is the full stack trace.

<=== TRAINING COMMAND START ===>
TORCH_EXTENSIONS_DIR=/tmp/torch_extensions cd /ssd_scratch/cvit/aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_50_how2sign-7870d59 && /ssd_scratch/cvit/aditya1/stylegan-v/env/bin/python src/train.py hydra.run.dir=. hydra.output_subdir=null hydra/job_logging=disabled hydra/hydra_logging=disabled
<=== TRAINING COMMAND END ===>
We are going to resume the training and the experiment already exists. That's why the provided config/training_cmd are discarded and the project dir is not created.

Training config is located in `experiment_config.yaml`

Output directory:   /ssd_scratch/cvit/aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_50_how2sign-7870d59/output
Training data:      data/how2sign_faces_styleganv_resized
Training duration:  25000 kimg
Number of GPUs:     2
Number of videos:   10000
Image resolution:   256
Conditional model:  False
Dataset x-flips:    True

Creating output directory...
Traceback (most recent call last):
  File "/ssd_scratch/cvit/aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_50_how2sign-7870d59/src/train.py", line 451, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/ssd_scratch/cvit/aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_50_how2sign-7870d59/src/train.py", line 437, in main
    os.makedirs(args.run_dir, exist_ok=args.resume_whole_state)
  File "/ssd_scratch/cvit/aditya1/stylegan-v/env/lib/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/ssd_scratch/cvit/aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_50_how2sign-7870d59/output'

I tried debugging the issue quickly, but the problem seems to be emanating from the value of the parameter c.resume in the train.py file. Even though the flag training.resume=latest is enabled, which should set the value of the field c.resume to "latest", it isn't happening that way. Even with the training.resume=latest flag set, the value of c.resume is False. I fixed it for now by explicitly setting the value of c.resume to True for now as I didn't want to go inside the configs too much myself for now.

I guess this is possible because when the file experiment_config.yaml is being read, the value of c.resume remains None. This is what the output of cfg.training looks like for me -- {'outdir': '${project_release_dir}', 'data': '${dataset.path}', 'gpus': '${num_gpus}', 'cfg': 'auto', 'snap': 50, 'kimg': 25000, 'metrics': ['fvd2048_16f', 'fvd2048_128f', 'fvd2048_128f_subsample8f', 'fid50k_full'], 'aug': 'ada', 'mirror': True, 'batch_size': 8, 'resume': None, 'seed': 0, 'dry_run': False, 'cond': False, 'subset': None, 'p': None, 'target': 0.6, 'augpipe': 'bgc', 'freezed': 0, 'fp32': False, 'nhwc': False, 'nobench': False, 'allow_tf32': False, 'num_workers': 3}

universome / stylegan-v

Resume training causes output folder File exists error #18