jiaweizzhao / GaLore

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Apache License 2.0
1.43k stars 148 forks source link

Figure 1 clarification on batch size and sequence length #57

Open psandovalsegura opened 4 months ago

psandovalsegura commented 4 months ago

In Figure 1, what is batch size, sequence len, and vocab size here? It isn't clear from the caption. I would expect activations to take up more space. From what I can tell:

So only the logits of the Llama model should take up 256 * 2048 * 32000 * 2 bytes or 31.25 GB. Where is this required memory in Figure 1?

Thanks!

psandovalsegura commented 3 months ago

Even if "token batch size" means the input is 1 sequence of length 256 tokens, I am unable to reproduce 2GB of activation memory.