Configuration required for fine-tuning mt5-large, xl, xxl

In float32, you need to be able to store 4x as many bytes as there are parameters for the parameters alone. The Adafactor gradient accumulator variables and cached activations for backprop will take additional memory. However, you should worry about FLOPs more than memory, since that is not an extreme memory requirement (especially if you use gradient accumulation) but the pre-training takes a very long time (XXL was ~21 days on a v3-1024 TPU device which is roughly equivalent to 1024 GPUs; XL and Large take about 4x and 16x less compute respectively).

google-research / multilingual-t5

Configuration required for fine-tuning mt5-large, xl, xxl #104