Error encountered while running multi-gpu training

skymanaditya1 commented 2 years ago

Hi, I am running stylegan_v on a custom dataset and everything seems to work fine until an error related to "leaked semaphores" is encountered.

The data is correctly loaded, the model is initialized, and it even starts printing out the intermediate losses and accuracy. After a few mins of training, I get this error -- "Exception: process 0 terminated with signal SIGKILL"

Please find the full stack trace below -- Setting up augmentation... Distributing across 2 GPUs... Setting up training phases... Exporting sample images... Initializing logs... Training for 25000 kimg...

tick 0 kimg 0.0 time 52s sec/tick 1.5 sec/kimg 62.98 maintenance 51.0 cpumem 2.63 gpumem 9.00 augment 0.000 Traceback (most recent call last): File "/ssd_scratch/cvit/aditya1/baselines/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 451, in main() # pylint: disable=no-value-for-parameter File "/ssd_scratch/cvit/aditya1/baselines/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 446, in main torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus) File "/ssd_scratch/cvit/aditya1/baselines/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/ssd_scratch/cvit/aditya1/baselines/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/ssd_scratch/cvit/aditya1/baselines/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 105, in join raise Exception( Exception: process 0 terminated with signal SIGKILL

//stylegan-v/env/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 34 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

universome commented 2 years ago

HiI Could you please tell whether the original StyleGAN2-ADA runs on your system? Because it looks like the issue is with process launching which we inherited from there. In this case, it might be helpful to check their troubleshooting guide

universome commented 2 years ago

Also, as far as I remember SIGKILL is sometimes sent by slurm when one exceeds something like memory limit. Do you run it with slurm? Could you try reducing training resolutions and batch sizes to some minimal values, like 32x32 and 16? And SIGKILL can be sent by other system managers

skymanaditya1 commented 2 years ago

Thanks for the reply! I was able to train StayleGan2's PyTorch implementation from https://github.com/lucidrains/stylegan2-pytorch. I indeed run this on a slurm server. The way I run this is a different process runs on GPU 0, while I try the StyleGan-V on GPUs 1 and 2. I see that I don't see this error when I train StyleGAN-V in isolation using the full 4 GPUs with a small enough batch size (32). However, I see a different type of error -- (Stacktrace below)

Setting up augmentation... Distributing across 4 GPUs... Setting up training phases... Exporting sample images... Initializing logs... Training for 25000 kimg...

tick 0 kimg 0.1 time 23s sec/tick 2.6 sec/kimg 26.76 maintenance 20.4 cpumem 2.70 gpumem 9.26 augment 0.000 Evaluating metrics for how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5 ... {"results": {"fvd2048_16f": 5910.987411352878}, "metric": "fvd2048_16f", "total_time": 51.558385610580444, "total_time_str": "52s", "num_gpus": 4, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1652602566.066554} Traceback (most recent call last): File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 451, in main() # pylint: disable=no-value-for-parameter File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 446, in main torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus) File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 375, in subprocess_fn training_loop.training_loop(rank=rank, args) File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/training/training_loop.py", line 508, in training_loop result_dict = metric_main.calc_metric( File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_main.py", line 49, in calc_metric all_runs_results = [_metricdictmetric for in range(num_runs)] File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_main.py", line 49, in all_runs_results = [_metricdictmetric for in range(num_runs)] File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_main.py", line 123, in fvd2048_128f fvd = frechet_video_distance.compute_fvd(opts, max_real=2048, num_gen=2048, num_frames=128) File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/frechet_video_distance.py", line 31, in compute_fvd mu_real, sigma_real = metric_utils.compute_feature_stats_for_dataset( File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(args, kwargs) File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_utils.py", line 195, in compute_feature_stats_for_dataset dataset = dnnlib.util.construct_class_by_name(dataset_kwargs) File "/home2/ravi_mishra/aditya1/stylegan-v/src/dnnlib/util.py", line 292, in construct_class_by_name return call_func_by_name(args, func_name=class_name, kwargs) File "/home2/ravi_mishra/aditya1/stylegan-v/src/dnnlib/util.py", line 287, in call_func_by_name return func_obj(*args, **kwargs) File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/training/dataset.py", line 335, in init raise IOError('No videos found in the specified archive') OSError: No videos found in the specified archive

I am trying to debug this as well.

skymanaditya1 commented 2 years ago

FYI, my data dir has 10,000 videos each having exactly 25 frames. I am trying to compare the performance of StyleGAN-v against our method for the task of unconditional video generation on fewer videos and fewer frames per video. I think from the code it appears that it rejects dirs which don't satisfy the criteria of having a minimum number of frames, which is why it throws the error - "No videos found in the specified archive".

universome commented 2 years ago

Yes, I think the issue is that it is trying to compute two metrics which require 128-frames videos, and the dataset class rejects all the short videos.

skymanaditya1 commented 2 years ago

Yes, thank you! Closing.

universome / stylegan-v

Error encountered while running multi-gpu training #13