Errors training BigGAN on TPU pods

apoorvkh commented 5 years ago

I'm trying to train a 128x128 image dataset with the BigGAN implementation here using a v2-128 pod, but am encountering several changing errors (highlights listed below) after the first "Dequeue next (500) batch(es) of data from outfeed". These remain even when I change the batch size from 2048 to 1024 and reduce iterations per run, etc. These don't occur when training on v2-8 or v3-8 TPUs. Have you ever encountered these while trying to train on pods instead, if that is the issue? Thanks!

Error recorded from infeed: Unable to enqueue when not opened
Caused by op u'input_pipeline_task0/while/InfeedQueue/enqueue/2'
Error recorded from outfeed: Step was cancelled by an explicit call to Session::Close().

apoorvkh commented 5 years ago

I actually likely encountered these issues due to malformed input data and the TPU errors were just not informative. Please ignore.

Marvin182 commented 5 years ago

Thank you for investigating this. While we don't test for all TPU configurations it should work with v2 (given that the model fits in the memory).

Please file a bug against TensorFlow for a more informative error message.

google / compare_gan

Errors training BigGAN on TPU pods #20