google / compare_gan

Compare GAN code.
Apache License 2.0
1.82k stars 319 forks source link

Errors training BigGAN on TPU pods #20

Closed apoorvkh closed 5 years ago

apoorvkh commented 5 years ago

I'm trying to train a 128x128 image dataset with the BigGAN implementation here using a v2-128 pod, but am encountering several changing errors (highlights listed below) after the first "Dequeue next (500) batch(es) of data from outfeed". These remain even when I change the batch size from 2048 to 1024 and reduce iterations per run, etc. These don't occur when training on v2-8 or v3-8 TPUs. Have you ever encountered these while trying to train on pods instead, if that is the issue? Thanks!

apoorvkh commented 5 years ago

I actually likely encountered these issues due to malformed input data and the TPU errors were just not informative. Please ignore.

Marvin182 commented 5 years ago

Thank you for investigating this. While we don't test for all TPU configurations it should work with v2 (given that the model fits in the memory).

Please file a bug against TensorFlow for a more informative error message.