google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Apache License 2.0
3.22k stars 571 forks source link

Ran out of memory in memory space hbm on RACE xlarge v3 on TPU v2-8 #168

Open theword opened 4 years ago

theword commented 4 years ago

First I am getting this warning:

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

Then following shortly after:

ERROR:tensorflow:Error recorded from outfeed: Step was cancelled by an explicit call to Session::Close(). E0221 21:23:45.654859 139752372131584 error_handling.py:75] Error recorded from outfeed: Step was cancelled by an explicit call to Session::Close(). ERROR:tensorflow:Error recorded from training_loop: From /job:worker/replica:0/task:0: Compilation failure: Ran out of memory in memory space hbm. Used 12.34G of 8.00G hbm. Exceeded hbm capacity by 4.34G.

The only code change I have made is instead of reading an all.txt file, I am reading in each individual file. cur_path_list = tf.gfile.Glob(cur_dir + "/*.txt")

I am running Tensorflow 1.15.2 with Python 3. The base set worked perfectly with a 70% but now I am unable to run xlarge on the TPU. Is the xlarge model too large for TPU v2-8? Do I need to upgrade to TPU v3-8 or a pod setup? I am using the default config file from tfhub and parameters from the READMe file with the change of the learning rate to 2e-5.

Danny-Google commented 4 years ago

We haven't tried it on TPU-v2 version, but how about you try it without dropout? We found that remove dropout can significantly reduce memory consumption.

guotong1988 commented 4 years ago

May I ask where did you get the TPUs? Thank you very much.