Training using Google Colab TPU (tips and tricks?)

capilano commented 5 years ago

I tried training the horses to zebra dataset using Colab's TPU with a batch_size of 32(4 per TPU core). The main differences were that I used zero padding(since reflect padding is not a supported TPU op) and uniform(either 3 or 5) kernel_sizes and transposed conv2d using stride 2 for upsampling.. I trained for about 300 epochs, each epoch took around 30 seconds, and I had reasonable results. Also, I did not use an image pool, just current mini batch images to train the discriminator. Some observations:

There were some boundary artifacts and ghosting effects for some images.
Some images were reasonably good in terms of domain transfer and general quality of the images,some had typical GAN artifacts
Horses to zebras generally did better than zebras to horses. Has anyone tried training using large batches? I used instance normalization and similar learning rate schedule as described in the paper. Any tips or tricks for training using larger batches?Maybe batch_norm instead of instance_norm?

junyanz commented 5 years ago

Using a better padding layer (most paddings are better than the zero-padding) will partly address the boundary artifacts.
GAN artifacts sometimes appear. Either reducing the learning rate or using more recent GAN loss and architecture might help.
Yes. For zebras->horses, it tries to hide the zebra strips so that it can reconstruct it later. Sometimes this will cause artifacts. This is one of the limitations of CycleGAN. See the analysis paper for more details.
We haven't used larger batches.

capilano commented 5 years ago

Thanks for your reply. 1) I used tensorflow which does not support reflect or symmetric paddings (TPU specific). The padding itself is supported but the gradient is not defined for TPU's 2) Learning rate starts at 2e-4 and decays down to 1e-6 towards the end. 3) Just out of curiosity,will adding extra channels help (4 or 5 input channels instead of 3) 4) I did not experiment with architectures, just used the resnet architecture with patch GAN discriminator and LSGAN loss. Mainly wanted to see how using a larger batch size impacts training in terms of speed,quality of generated images etc

junyanz commented 5 years ago

We haven't made it work by adding additional channels.
You are free to try if the memory size is not an issue.

capilano commented 5 years ago

Thanks, I did try using batch size of 32,I think it was possible to go up to 56 with minor changes in the architecture, after that there were memory issues. I think the results were decent, the main advantage being that the training gets completed in about one and a half hours.

junyanz / CycleGAN

Training using Google Colab TPU (tips and tricks?) #122