Training got stuck - Githubissues

google-research / tapas

End-to-end neural table-text understanding models.

Apache License 2.0

1.15k stars 217 forks source link

Training got stuck #16

Closed sparshbhawsar closed 4 years ago

sparshbhawsar commented 4 years ago

Hello, @jiabol thanks for the report, we haven't encountered that warning before, but on a first inspection, it seems to be triggered by the usage of the AUTOTUNE tapas/datasets/dataset.py, and it's logged here.

While we continue to investigate, can you try:
a) Use a bigger machine in terms of CPU / RAM, what are the spects of the one you are currently using?
b) Replace the usage of AUTOTUNE in tapas/datasets/dataset.py for specific values? Make sure you installed with -e for the changes to be reflected without re-installation.

Thanks!

Hello,

I got the same issue #6 while training I am using Google colab 12.72 GB RAM TPU will you please suggest How do I replace the usage of AUTOTUNE for specific value ? How to decide the value ?

Thanks! Sparsh Bhawsar

eisenjulian commented 4 years ago

Hi, can you share the script you are using to launch the training, are you using the use_tpu flag? Another potential fix would be to decrease the train_batch_size from 512 to say 64 and check if the problem goes away.

sparshbhawsar commented 4 years ago

Hi, can you share the script you are using to launch the training, are you using the use_tpu flag? Another potential fix would be to decrease the train_batch_size from 512 to say 64 and check if the problem goes away.

I am not using use_tpu flag. I tried it for different train_batch_size (512,128,64,32...) and the training is working for only when train_batch_size=4 but it's taking lots of time.

eisenjulian commented 4 years ago

But if you are not using the use_tpu flag that means that you're training on CPU, that's expected to take a lot of time sadly. Are you using a base or large model?

sparshbhawsar commented 4 years ago

Hello @muelletm @eisenjulian I am using base model and want to know is it possible to train model using colab TPU with 12.72 GB RAM because I tried to train model with all batch size from 512 to 4, model training works for lower batch size but it got stop in between Recently, I was training WTQ model it took around 10 hours and it was still in training mode than it got stop would you please tell me what's the issue?

muelletm commented 4 years ago

I not 100% sure but I think that's WAI: colabs will timeout when running too long. I think that's because they are meant to demo models & code rather than to train full models.

Have you tried saving the progress after a couple of hours and continuning from there?

Alternatively, you could run on a Cloud TPU.

sparshbhawsar commented 4 years ago

Yes, you are right it's a timeout issue colab has session timeout limit of 12 hours.

Actually, I was unable to save the progress because as the training got stop in between due to which all the progress lost.

Okay, will run on Cloud TPU.

Thanks