Closed asartipi13 closed 2 years ago
In float32, you need to be able to store 4x as many bytes as there are parameters for the parameters alone. The Adafactor gradient accumulator variables and cached activations for backprop will take additional memory. However, you should worry about FLOPs more than memory, since that is not an extreme memory requirement (especially if you use gradient accumulation) but the pre-training takes a very long time (XXL was ~21 days on a v3-1024 TPU device which is roughly equivalent to 1024 GPUs; XL and Large take about 4x and 16x less compute respectively).
Hi everyone,
I have a question. What are the minimum or best server configs in order to fine-tune MT5-large, xl and xxl. How much GPU ram does it need? Thanks in advance for your answers.