Closed Double-bear closed 8 months ago
You should definitely be able to fit the FP16 version of Llama2-70B within 8x80GB. The problem in this case seems to be that it's reserving 68 GB per GPU for the attention weights matrix (2 buffers of 64 attention heads, 16384**2 elements each, 2 bytes per element).
Installing flash-attn should reduce that allocation somewhat, but another option is to simply reduce max_input_len and max_attention_size (e.g. to the default 2048, 2048**2). You will still be able to run inference on 16384-token sequences, they will simply be run in chunks, substantially reducing the VRAM required for temporary buffers and attention weights.
It's a tradeoff of course, with attention on eight 2048-token chunks taking a bit more time than a single 16384-token chunk (though the output is the same), but without flash-attn available it does save 67 GB of VRAM per GPU, so it's probably worth considering. Of course with 8x80GB available you could probably go a lot higher, but there are diminishing returns at some point, whereas the attention matrix keeps scaling quadratically.
You should definitely be able to fit the FP16 version of Llama2-70B within 8x80GB. The problem in this case seems to be that it's reserving 68 GB per GPU for the attention weights matrix (2 buffers of 64 attention heads, 16384**2 elements each, 2 bytes per element).
Installing flash-attn should reduce that allocation somewhat, but another option is to simply reduce max_input_len and max_attention_size (e.g. to the default 2048, 2048**2). You will still be able to run inference on 16384-token sequences, they will simply be run in chunks, substantially reducing the VRAM required for temporary buffers and attention weights.
It's a tradeoff of course, with attention on eight 2048-token chunks taking a bit more time than a single 16384-token chunk (though the output is the same), but without flash-attn available it does save 67 GB of VRAM per GPU, so it's probably worth considering. Of course with 8x80GB available you could probably go a lot higher, but there are diminishing returns at some point, whereas the attention matrix keeps scaling quadratically.
Thank you so much! I will try it again.
Hello, I attempted to use the exllama2 on the Llama2-70B model with GPU A800, 80GB. During the model loading process, since the model I trained has the context length of 16k, I set both max_seq_len and max_input_len to 16384, and max_attention_size to max_seq_len squared. The gpu split is [75]*8. However, I encountered an error when using a single machine with 8 cards:
checking data as the following:
In this case, if I don't want to quantize the model, are there any other solutions?