Running full cluster on Colab 345M and 2048 context

trisongz commented 4 years ago

Have been following your work on twitter - thanks for taking the time and documenting your progress and process. It's been incredibly insightful for helping understand the nuances of fine-tuning and training with different parameters rather than other explanations that say "just do this"

Currently running your notebook on colab with TPU using the latest branch. I do understand that it's been intentionally capped in order to prevent OOM for the larger models.

I'm trying to test the extent of context tokens for input training to output generation using 2048 and starting with the 345M model as the base. I had set the batch size to 8 but I believe the latest TPU code only uses 1 core regardless.

Is there a specific branch to use for the full cluster TPU with the assumption that it wouldn't go OOM?

shawwn commented 4 years ago

Nope, right now there's no branch for full-cluster TPUs. However, I did run an experiment where I ran 8 copies of the model on 8 different cores, and aggregated the results. I observed a 4x speedup, as measured in tokens/sec being processed.

HOWEVER, TPUs freeze. A lot. I don't know why. And it took a loooong time to start 8 models on 8 different cores -- probably about 30 minutes to start up. So the gains are diminished somewhat due to that.

I am currently working on some code to run on multiple TPUs at once, rather than try to use multiple cores of a single TPU. But, if you want the 8-core TPU version, I can push the (unfinished, somewhat awful) code to a separate branch and you can run with it.

Also see https://github.com/shawwn/gpt-2/issues/5 for ongoing performance discussion.

leejason commented 4 years ago

It would be appreciated to have the the "8-core TPU version." Can that version be scaled to TPU Pod without modification if TFRC resource is available?

shawwn / gpt-2

Running full cluster on Colab 345M and 2048 context #4