Open YannickWehr opened 4 years ago
I've also had a similar issue with running the Reformer on a Colab TPU, using this gin config: https://github.com/google/trax/blob/master/trax/supervised/configs/reformer_imagenet64.gin
which also seems to use n_hashes > 1.
It seems to be some tpu driver bug (I don't have details on that).
I managed to fix the problem by requesting a different version of tpu_driver
, so in case of your notebook you should change:
url = 'http://' + os.environ['COLAB_TPU_ADDR'].split(':')[0] + ':8475/requestversion/tpu_driver0.1-dev20191206'
to
url = 'http://' + os.environ['COLAB_TPU_ADDR'].split(':')[0] + ':8475/requestversion/tpu_driver_nightly'
in the first cell of your colab. Hope that solves your problem too.
Description
When attempting to train a Reformer with LSH Attention with n_hashes > 1 on a TPU, training will get stuck, and the trainer is not able to complete even a single training step.
Environment information
Steps to reproduce