google / trax

Trax — Deep Learning with Clear Code and Speed
Apache License 2.0
8.09k stars 814 forks source link

LSH Reformer with multiple hashes not possible on TPU #998

Open YannickWehr opened 4 years ago

YannickWehr commented 4 years ago

Description

When attempting to train a Reformer with LSH Attention with n_hashes > 1 on a TPU, training will get stuck, and the trainer is not able to complete even a single training step.

Environment information

Google Colab VM

Steps to reproduce

Open up for example this colab: https://colab.research.google.com/github/google/trax/blob/master/trax/models/reformer/text_generation.ipynb
Set LSH n_hashes to 2 in the gin config and leave everything else as is.
Set accelerator to TPU and attempt to train one step, it will get stuck.
Set accelerator to GPU and retry, training will run normally.
syzymon commented 3 years ago

I've also had a similar issue with running the Reformer on a Colab TPU, using this gin config: https://github.com/google/trax/blob/master/trax/supervised/configs/reformer_imagenet64.gin

which also seems to use n_hashes > 1.

It seems to be some tpu driver bug (I don't have details on that). I managed to fix the problem by requesting a different version of tpu_driver, so in case of your notebook you should change:

url = 'http://' + os.environ['COLAB_TPU_ADDR'].split(':')[0] + ':8475/requestversion/tpu_driver0.1-dev20191206'

to url = 'http://' + os.environ['COLAB_TPU_ADDR'].split(':')[0] + ':8475/requestversion/tpu_driver_nightly'

in the first cell of your colab. Hope that solves your problem too.