Open Leiwx52 opened 4 years ago
Hi @WingsleyLui thank you for the comments! Indeed this is something I'm looking at, especially since there are some models (e.g. SAGAN) that seems to only work if I use a large batch size, which is only possible with multiple GPUs (or a really big one). Your suggestions are fantastic, and I will keep these in mind while I work on it -- will keep this issue open and update when it is done!
Hi! Thank you for your great contribution to this repo. Actually I found it really convenient to reproduce various GANs with the help of
mimicry
.However, when I read the code in the training part of the examples, I found it might work well for single GPU scenario but some how not quite suitable for distributed training. I had a glance at the source code of
Logger
, which enables the visualization using tensorboard, and it turns out that if one adopted distributed training withtorch.nn.parallel.DistributedDataParallel
, each process(one process for one GPU) will create a new file to record the information of that GPU/process. Apparently, this is not what we want. A possible solution is to create a Tensorboard file only if that process is rank 0 and only to record the average metric. If you are going to improve this, refer totorch.distributed
and see the example of imagenet training given by official Pytorch.