Closed apoorvkh closed 5 years ago
I actually likely encountered these issues due to malformed input data and the TPU errors were just not informative. Please ignore.
Thank you for investigating this. While we don't test for all TPU configurations it should work with v2 (given that the model fits in the memory).
Please file a bug against TensorFlow for a more informative error message.
I'm trying to train a 128x128 image dataset with the BigGAN implementation here using a v2-128 pod, but am encountering several changing errors (highlights listed below) after the first "Dequeue next (500) batch(es) of data from outfeed". These remain even when I change the batch size from 2048 to 1024 and reduce iterations per run, etc. These don't occur when training on v2-8 or v3-8 TPUs. Have you ever encountered these while trying to train on pods instead, if that is the issue? Thanks!
Session::Close()
.