Closed Ben-Epstein closed 6 months ago
With cuda-fp16, anything larger than 4096 gives me a memory allocation error, which is surprising because I can run 8k models easily.
There are a couple of options to avoid the memory allocation error.
past_present_share_buffer = false
in genai_config.json
. This will disable pre-allocating the KV caches to the maximum possible size. The KV caches are pre-allocated with size (batch_size, num_heads, max_length, head_size)
so for the Phi-3 mini model with 128K context length, the KV caches are of size (batch_size, 32, 131072, 96)
.past_present_share_buffer = true
to maintain the best performance, you can reduce the max_length
instead.And when I use cuda-int4-rtn-block-32, I get this error
OrtException: Non-zero status code returned while running GroupQueryAttention node. > Name:'/model/layers.0/attn/GroupQueryAttention' Status Message: cos_cache dimension 0 should be of max_sequence_length.
This error can happen with the Phi-3 mini 128K model in the following scenario.
past_present_share_buffer = true
)There are a couple of options to avoid this error.
past_present_share_buffer = false
. This will avoid the above error and the memory allocation error.max_length
to 4KPrompt length is less than 4K / Prompt length + generation length is greater than 4K
Why would these cause an issue? Just trying to get a better understanding
It is caused by a check in ONNX Runtime that previously required the first dimension of the cos/sin caches in the rotary embeddings to be the same size as the third dimension of the KV caches.
When buffer sharing is enabled, the third dimension of the KV caches is set to 131072 by default since that is the default value for max_length
. Depending on the prompt length passed to the model, the first dimension of the cos/sin caches can either be 4096 or 131072. When the prompt length is less than 4096, the cos/sin cache that is selected is the one where the first dimension is of size 4096. The check will see that cos_dims[0] = 4096
, present_sequence_length = 131072
, and cos_dims[0] < present_sequence_length
is true so the error will get raised.
Because present_sequence_length
just has to be greater than 4096 for this error to occur, this means that for a prompt length less than 4096, any value for max_length
that is greater than 4096 will cause this error. Since this error isn't really a limitation of the max length but rather how the KV caches are pre-allocated, I described this as prompt length + generation length is greater than 4K
instead.
The check has been fixed in this PR. If you build ONNX Runtime from source and then build ONNX Runtime GenAI from source, the above errors should go away and you should be able to use past_present_share_buffer = true
.
@kunal-vaishnavi thanks so much, this worked perfectly. For now i'll run with setting past_present_share_buffer=False
until the next release :)
I've been testing out phi3-128k, but am running into issues using larger context windows (>4000)
With
cuda-fp16
, anything larger than 4096 gives me a memory allocation error, which is surprising because I can run 8k models easily.And when I use
cuda-int4-rtn-block-32
, I get this errorHere's some code I used
and then
Can you help me understand what that error is? Here is my nvidia config, I have a T4
Thanks!