Closed skymanaditya1 closed 2 years ago
HiI Could you please tell whether the original StyleGAN2-ADA runs on your system? Because it looks like the issue is with process launching which we inherited from there. In this case, it might be helpful to check their troubleshooting guide
Also, as far as I remember SIGKILL is sometimes sent by slurm when one exceeds something like memory limit. Do you run it with slurm? Could you try reducing training resolutions and batch sizes to some minimal values, like 32x32 and 16? And SIGKILL can be sent by other system managers
Thanks for the reply! I was able to train StayleGan2's PyTorch implementation from https://github.com/lucidrains/stylegan2-pytorch. I indeed run this on a slurm server. The way I run this is a different process runs on GPU 0, while I try the StyleGan-V on GPUs 1 and 2. I see that I don't see this error when I train StyleGAN-V in isolation using the full 4 GPUs with a small enough batch size (32). However, I see a different type of error -- (Stacktrace below)
Setting up augmentation... Distributing across 4 GPUs... Setting up training phases... Exporting sample images... Initializing logs... Training for 25000 kimg...
tick 0 kimg 0.1 time 23s sec/tick 2.6 sec/kimg 26.76 maintenance 20.4 cpumem 2.70 gpumem 9.26 augment 0.000
Evaluating metrics for how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5 ...
{"results": {"fvd2048_16f": 5910.987411352878}, "metric": "fvd2048_16f", "total_time": 51.558385610580444, "total_time_str": "52s", "num_gpus": 4, "snapshot_pkl": "network-snapshot-000000.pkl", "timestamp": 1652602566.066554}
Traceback (most recent call last):
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 451, in
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 375, in subprocess_fn
training_loop.training_loop(rank=rank, args)
File "/ssd_scratch/cvit/mlp_aditya1/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/training/training_loop.py", line 508, in training_loop
result_dict = metric_main.calc_metric(
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_main.py", line 49, in calc_metric
all_runs_results = [_metricdictmetric for in range(num_runs)]
File "/home2/ravi_mishra/aditya1/stylegan-v/src/metrics/metric_main.py", line 49, in
I am trying to debug this as well.
FYI, my data dir has 10,000 videos each having exactly 25 frames. I am trying to compare the performance of StyleGAN-v against our method for the task of unconditional video generation on fewer videos and fewer frames per video. I think from the code it appears that it rejects dirs which don't satisfy the criteria of having a minimum number of frames, which is why it throws the error - "No videos found in the specified archive".
Yes, I think the issue is that it is trying to compute two metrics which require 128-frames videos, and the dataset class rejects all the short videos.
Yes, thank you! Closing.
Hi, I am running stylegan_v on a custom dataset and everything seems to work fine until an error related to "leaked semaphores" is encountered.
The data is correctly loaded, the model is initialized, and it even starts printing out the intermediate losses and accuracy. After a few mins of training, I get this error -- "Exception: process 0 terminated with signal SIGKILL"
Please find the full stack trace below -- Setting up augmentation... Distributing across 2 GPUs... Setting up training phases... Exporting sample images... Initializing logs... Training for 25000 kimg...
tick 0 kimg 0.0 time 52s sec/tick 1.5 sec/kimg 62.98 maintenance 51.0 cpumem 2.63 gpumem 9.00 augment 0.000 Traceback (most recent call last): File "/ssd_scratch/cvit/aditya1/baselines/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 451, in
main() # pylint: disable=no-value-for-parameter
File "/ssd_scratch/cvit/aditya1/baselines/stylegan-v/experiments/how2sign_faces_styleganv_resized_stylegan-v_random3_max32_how2sign_exp_styleganv_resized-aaf99e5/src/train.py", line 446, in main
torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
File "/ssd_scratch/cvit/aditya1/baselines/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/ssd_scratch/cvit/aditya1/baselines/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
while not context.join():
File "/ssd_scratch/cvit/aditya1/baselines/stylegan-v/env/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 105, in join
raise Exception(
Exception: process 0 terminated with signal SIGKILL
//stylegan-v/env/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 34 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '