In Figure 1, what is batch size, sequence len, and vocab size here? It isn't clear from the caption. I would expect activations to take up more space. From what I can tell:
batch size seems to be 256 based on Fig. 1 caption
sequence len seems to be 2048, based on footnote 1
In Figure 1, what is batch size, sequence len, and vocab size here? It isn't clear from the caption. I would expect activations to take up more space. From what I can tell:
So only the logits of the Llama model should take up
256 * 2048 * 32000 * 2
bytes or 31.25 GB. Where is this required memory in Figure 1?Thanks!