Closed sparshbhawsar closed 4 years ago
Hi, can you share the script you are using to launch the training, are you using the use_tpu flag? Another potential fix would be to decrease the train_batch_size from 512 to say 64 and check if the problem goes away.
Hi, can you share the script you are using to launch the training, are you using the use_tpu flag? Another potential fix would be to decrease the train_batch_size from 512 to say 64 and check if the problem goes away.
I am not using use_tpu flag. I tried it for different train_batch_size (512,128,64,32...) and the training is working for only when train_batch_size=4 but it's taking lots of time.
But if you are not using the use_tpu flag that means that you're training on CPU, that's expected to take a lot of time sadly. Are you using a base or large model?
Hello @muelletm @eisenjulian I am using base model and want to know is it possible to train model using colab TPU with 12.72 GB RAM because I tried to train model with all batch size from 512 to 4, model training works for lower batch size but it got stop in between Recently, I was training WTQ model it took around 10 hours and it was still in training mode than it got stop would you please tell me what's the issue?
I not 100% sure but I think that's WAI: colabs will timeout when running too long. I think that's because they are meant to demo models & code rather than to train full models.
Have you tried saving the progress after a couple of hours and continuning from there?
Alternatively, you could run on a Cloud TPU.
Yes, you are right it's a timeout issue colab has session timeout limit of 12 hours.
Actually, I was unable to save the progress because as the training got stop in between due to which all the progress lost.
Okay, will run on Cloud TPU.
Thanks
Hello,
I got the same issue #6 while training I am using Google colab 12.72 GB RAM TPU will you please suggest How do I replace the usage of AUTOTUNE for specific value ? How to decide the value ?
Thanks! Sparsh Bhawsar