sihyun-yu / digan

Official PyTorch implementation of Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks (ICLR 2022).
https://sihyun.me/digan/
182 stars 19 forks source link

About the GPU requirement #2

Closed johannwyh closed 2 years ago

johannwyh commented 2 years ago

Dear authors,

Hello! First of all, thank you for your inspiring work!

I encountered an issue with multi-GPU training on my 8 V100-16G GPUs. When distributing models across GPUs,

if rank == 0:
    print(f'Distributing across {num_gpus} GPUs...')
ddp_modules = dict()
for name, module in [('G_mapping', G.mapping), ('G_synthesis', G.synthesis), ('D', D), (None, G_ema), ('augment_pipe', augment_pipe)]:
    if rank == 0:
        print("[Distributing] Module {} ...".format(name))

    if (num_gpus > 1) and (module is not None) and len(list(module.parameters())) != 0:
        module.requires_grad_(True)
        module = torch.nn.parallel.DistributedDataParallel(module, device_ids=[device], broadcast_buffers=False,
                                                           find_unused_parameters=False)
        module.requires_grad_(False)

    if rank == 0:
        print("[Distributed] Module {}".format(name))

    if name is not None:
        ddp_modules[name] = module

the process failed on first module G_mapping, reporting

[Distributing] Module G_mapping ...
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1640811806235/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.  

The GPU memory consumption status is as follow,

wangyuhan-8-v100         Sat Mar  5 12:28:13 2022  460.73.01
[0] Tesla V100-SXM2-16GB | 36'C,  22 % | 15415 / 16160 MB | yuhan:python/31701(1283M) yuhan:python/31696(6905M) yuhan:python/31699(1151M) yuhan:python/31700(1241M) yuhan:python/31697(1283M) yuhan:python/31698(1283M) yuhan:python/31702(1175M) yuhan:python/31703(1099M)
[1] Tesla V100-SXM2-16GB | 37'C,   0 % |  2022 / 16160 MB | yuhan:python/31697(2019M)
[2] Tesla V100-SXM2-16GB | 38'C,   0 % |  2022 / 16160 MB | yuhan:python/31698(2019M)
[3] Tesla V100-SXM2-16GB | 39'C,   0 % |  2014 / 16160 MB | yuhan:python/31699(2011M)
[4] Tesla V100-SXM2-16GB | 35'C,   0 % |  2014 / 16160 MB | yuhan:python/31700(2011M)
[5] Tesla V100-SXM2-16GB | 35'C,   0 % |  2022 / 16160 MB | yuhan:python/31701(2019M)
[6] Tesla V100-SXM2-16GB | 36'C,   0 % |  2014 / 16160 MB | yuhan:python/31702(2011M)
[7] Tesla V100-SXM2-16GB | 37'C,   0 % |  2014 / 16160 MB | yuhan:python/31703(2011M)

I am not very familiar with this and seemingly GPU_0 is running out of memory. I am wondering whether it is the reason behind the ncclUnhandledError.

Could you please help me figure out what caused this error? Is your implementation working on 16GB V100 GPUs?

Thank you very much.

johannwyh commented 2 years ago

Sorry for the bother, I solved this issue but exactly following the environment setup given by the README.

BTW, the environment that failed in the above situation is as followed,

>>> torch.__version__
'1.10.2'
>>> torch.version.cuda
'11.3'

While the nvcc -V is 11.3

sihyun-yu commented 2 years ago

Thanks for pointing it out! I'll close the issue.