kwotsin / mimicry

[CVPR 2020 Workshop] A PyTorch GAN library that reproduces research results for popular GANs.
MIT License
602 stars 62 forks source link

Are there any plans to develop a distributed version? #23

Open Leiwx52 opened 4 years ago

Leiwx52 commented 4 years ago

Hi! Thank you for your great contribution to this repo. Actually I found it really convenient to reproduce various GANs with the help of mimicry.

However, when I read the code in the training part of the examples, I found it might work well for single GPU scenario but some how not quite suitable for distributed training. I had a glance at the source code of Logger, which enables the visualization using tensorboard, and it turns out that if one adopted distributed training with torch.nn.parallel.DistributedDataParallel, each process(one process for one GPU) will create a new file to record the information of that GPU/process. Apparently, this is not what we want. A possible solution is to create a Tensorboard file only if that process is rank 0 and only to record the average metric. If you are going to improve this, refer to torch.distributed and see the example of imagenet training given by official Pytorch.

kwotsin commented 4 years ago

Hi @WingsleyLui thank you for the comments! Indeed this is something I'm looking at, especially since there are some models (e.g. SAGAN) that seems to only work if I use a large batch size, which is only possible with multiple GPUs (or a really big one). Your suggestions are fantastic, and I will keep these in mind while I work on it -- will keep this issue open and update when it is done!