microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime
MIT License
485 stars 121 forks source link

OnnxRuntimeGenAIException: 'bad allocation' when option `max_length` is not set #980

Open f2bo opened 2 weeks ago

f2bo commented 2 weeks ago

Describe the bug

I've been using the Phi-3-mini-128k-instruct-onnx model successfully in an application with the Microsoft.ML.OnnxRuntimeGenAI.DirectML package for some time. Recently, I began exploring using Semantic Kernel and its ONNX connector instead and was surprised to find that it crashed with a bad allocation exception when using the same model given that Semantic Kernel also depends on the OnnxRuntimeGenAI package. In contrast, replacing the model with the Phi-3-mini-4k-instruct-onnx model, it ran successfully.

After comparing what the Semantic Kernel connector was doing with my original code, I managed to narrow it down to having missed setting a value for the max_length option when configuring the connector. In SK, this is named MaxTokens in the OnnxRuntimeGenAIPromptExecutionSettings class.

UPDATE: added stack trace

Microsoft.ML.OnnxRuntimeGenAI.OnnxRuntimeGenAIException
  HResult=0x80131500
  Message=bad allocation
  Source=Microsoft.ML.OnnxRuntimeGenAI
  StackTrace:
   at Microsoft.ML.OnnxRuntimeGenAI.Result.VerifySuccess(IntPtr nativeResult)
   at Microsoft.ML.OnnxRuntimeGenAI.Generator..ctor(Model model, GeneratorParams generatorParams)
   at Program.<<Main>$>d__0.MoveNext() in ...
   at Program.<Main>(String[] args)

To Reproduce The issue can be reproduced with the OnnxRuntimeGenAI package directly using the following code and commenting out the line that sets max_length.

// download from https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx and set modelPath
string modelPath = "..... \\microsoft\\Phi-3-mini-128k-instruct-onnx\\cpu_and_mobile\\cpu-int4-rtn-block-32-acc-level-4";

using Model model = new Model(modelPath);
using Tokenizer tokenizer = new Tokenizer(model);

string prompt = "Tell me a joke";
var sequences = tokenizer.Encode($"<|user|>{prompt}<|end|><|assistant|>");

using GeneratorParams generatorParams = new GeneratorParams(model);
generatorParams.SetSearchOption("max_length", 8192);            // <==== COMMENT OUT THIS LINE TO TRIGGER THE EXCEPTION
generatorParams.SetInputSequences(sequences);

using var tokenizerStream = tokenizer.CreateStream();
using var generator = new Generator(model, generatorParams);
while (!generator.IsDone())
{
    generator.ComputeLogits();
    generator.GenerateNextToken();
    Console.Write(tokenizerStream.Decode(generator.GetSequence(0)[^1]));
}

Expected behavior No exception. Alternatively, if the max_length parameter isn't optional, then a clearer exception than bad allocation.

Desktop (please complete the following information):

baijumeswani commented 2 weeks ago

Thank you for opening the issue and sharing your experience.

onnxruntime-genai uses the key-value cache for running inference on the models. The key-value cache buffer is allocated ahead of time (when the genai_config.json has past_present_share_buffer is true) and the size of the allocated buffer is max_length as defined by the user. The default value of max_length is equal to the context length of the model.

In essence, the reason you're seeing the issue is because onnxruntime-genai tries to allocate a buffer with shape [batch_size, num_key_value_heads, max_length, head_size] => [1, 32, 128K, 96] for the phi3-128k model. As you can imagine, this might be too big a buffer for your machine and hence results in a bad alloc.

On the other hand, the phi3-4k model has the key-value cache of [1, 32, 4K, 96] and can be allocated ahead of time on your machine.

As you have already found, the way to resolve this issue is to set max_length to something reasonable for your machine.

f2bo commented 2 weeks ago

Hi @baijumeswani.

Thank you for your reply. Wouldn't it be possible to provide a more informative message rather than simply bad allocation. I was able to troubleshoot this issue because I already had some working code and I could compare it with the failing Semantic Kernel code. Otherwise, I think it would have been difficult to identify the cause.

Thanks!

baijumeswani commented 2 weeks ago

You're right. It probably makes sense to provide more meaningful and actionable error messages to the end user. I will address this.