tensorflow / cloud

The TensorFlow Cloud repository provides APIs that will allow to easily go from debugging and training your Keras and TensorFlow code in a local environment to distributed training in the cloud.
https://github.com/tensorflow/cloud
Apache License 2.0
371 stars 84 forks source link

Tensorflow Cloud Tutorial failed to train model #341

Open karlunho opened 3 years ago

karlunho commented 3 years ago

When using the TensorFlow Cloud Tutorial https://www.tensorflow.org/cloud/tutorials/overview , the model fail to train on GCP. I've added the log file and a screenshot of the error . I'm a google employee and ldap is alankho - happy to share my GCP project with whoever is working on the bug.

The failure happened in Google AI Platform when training the model after about 5 hours

downloaded-logs-20210618-235625.csv

karlunho commented 3 years ago

Tried again with TPUs, but also errored out

downloaded-logs-20210619-002835.csv

karlunho commented 3 years ago

I can't tell if AI Platform is using the GPUs properly because when training, the GPU page show that the utilization is 0%, where as the CPU page shows close to 80%. I've uploaded screenshots for the team.

Screen Shot 2021-06-19 at 8 32 57 AM Screen Shot 2021-06-19 at 8 31 19 AM

aarushisoni commented 1 year ago

Hi my name is Aarushi Soni . I want to contribute to this issue . Is this issue still open ? I am first time contributor . Please guide me through this process.