Distributed training - Githubissues

google / compare_gan

Compare GAN code.

Apache License 2.0

1.82k stars 317 forks source link

Distributed training #1

Closed jppgks closed 6 years ago

jppgks commented 6 years ago

Thanks for open sourcing the code for this awesome paper!

I’m wondering if you used distributed training of the different GAN models during experimentation. If so, could you share an example of how to launch a distributed training job using compare_gan code?

kkurach commented 6 years ago

Hi Joppe,

the training of a single GAN is done on a single GPU (it's relatively fast for the architecture and datasets that we used).

We launched multiple experiments in parallel - first by running compare_gan_generate_tasks to create a set of experiment to run, then by running compare_gan_run_one_task on many machines (machine 0 with task_num=0, machine 1 with task_num=1, etc)

den-run-ai commented 5 years ago

@jppgks comparing multiple experiments in parallel is nothing like distributed training, unless hyper-parameter optimization is the end goal. Is this what you mean by multiple tasks?

Marvin182 commented 5 years ago

Note: We have updated the framework in the meantime and it now supports distributed training (single run on multiple machines) for TPUs.

den-run-ai commented 5 years ago

@Marvin182 where can I find this in the code?

Marvin182 commented 5 years ago

https://github.com/google/compare_gan