Open f2bo opened 2 weeks ago
Thank you for opening the issue and sharing your experience.
onnxruntime-genai
uses the key-value cache for running inference on the models. The key-value cache buffer is allocated ahead of time (when the genai_config.json
has past_present_share_buffer
is true
) and the size of the allocated buffer is max_length
as defined by the user. The default value of max_length
is equal to the context length of the model.
In essence, the reason you're seeing the issue is because onnxruntime-genai tries to allocate a buffer with shape [batch_size, num_key_value_heads, max_length, head_size]
=> [1, 32, 128K, 96]
for the phi3-128k model. As you can imagine, this might be too big a buffer for your machine and hence results in a bad alloc.
On the other hand, the phi3-4k model has the key-value cache of [1, 32, 4K, 96]
and can be allocated ahead of time on your machine.
As you have already found, the way to resolve this issue is to set max_length
to something reasonable for your machine.
Hi @baijumeswani.
Thank you for your reply. Wouldn't it be possible to provide a more informative message rather than simply bad allocation
. I was able to troubleshoot this issue because I already had some working code and I could compare it with the failing Semantic Kernel code. Otherwise, I think it would have been difficult to identify the cause.
Thanks!
You're right. It probably makes sense to provide more meaningful and actionable error messages to the end user. I will address this.
Describe the bug
I've been using the
Phi-3-mini-128k-instruct-onnx
model successfully in an application with theMicrosoft.ML.OnnxRuntimeGenAI.DirectML
package for some time. Recently, I began exploring using Semantic Kernel and its ONNX connector instead and was surprised to find that it crashed with abad allocation
exception when using the same model given that Semantic Kernel also depends on theOnnxRuntimeGenAI
package. In contrast, replacing the model with thePhi-3-mini-4k-instruct-onnx
model, it ran successfully.After comparing what the Semantic Kernel connector was doing with my original code, I managed to narrow it down to having missed setting a value for the
max_length
option when configuring the connector. In SK, this is namedMaxTokens
in theOnnxRuntimeGenAIPromptExecutionSettings
class.UPDATE: added stack trace
To Reproduce The issue can be reproduced with the OnnxRuntimeGenAI package directly using the following code and commenting out the line that sets
max_length
.Expected behavior No exception. Alternatively, if the
max_length
parameter isn't optional, then a clearer exception thanbad allocation
.Desktop (please complete the following information):