Closed pangtouyuqqq closed 1 year ago
Hello! Thanks for your interest. Sorry for omitting this info from the paper.
MT3 was trained on 16 TPUv3s. A training run of the mixture model for 1M steps (the training setup reported in the paper) took 2.8 days on this hardware.
Got it, thank you ~!
Thank you for opening soure your wonderful work, I am curious about how many TPUs were used and how long it takes to train the marvelous model?