[Performance] Why does genai run 2x as fast as vanilla managed onnxruntime?

microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

https://onnxruntime.ai

MIT License

14.1k stars 2.84k forks source link

[Performance] Why does genai run 2x as fast as vanilla managed onnxruntime? #21847

Open elephantpanda opened 3 weeks ago

elephantpanda commented 3 weeks ago

Describe the issue

I am running phi3-mini-int4 using the usual onnxruntime c# API and it is 2x as slow as when I use the genai code. I am using DirectML c# managed API and am testing it with sequence_length=1 each iteration and using bound inputs and outputs. Basically I am just calling this in a loop, and not changing the input each time for testing but it is still not as fast as genai: session.RunWithBinding(runOptions, binding);

So in that sense I can say well done for making genai so fast. 🙂

On the other hand, I wonder if you can share the settings or source code for things like sessionOptions and so on. GenAI is good but I really need to use the full capability of onnxruntime API. Since I believe GenAI is built on top of onnxruntime, it would be nice to be able to see the source code for this so I can make my app using onnxruntime API as fast as the GenAI code.

I am using the managed onnxruntime library from nuget 1.19.1 and it is using the DirectML.dll which was installed with genai.

Thanks for any help you can give.

To reproduce

running a phi-3 model using genai code and then trying to run the same model using onnxruntime c# api

Urgency

No response

Platform

Windows

OS Version

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19.1

ONNX Runtime API

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

tianleiwu commented 3 weeks ago

Source code of genai: https://github.com/microsoft/onnxruntime-genai.

For example, use i/o binding to bind past and present to a fixed buffer. Otherwise, copying kv cache will slow down generation significantly.

RyanUnderhill commented 2 weeks ago

In case you look at the GenAI code, the GenAI library doesn't use I/O binding but it passes preallocated output OrtValues to the Session::Run() function. This has the same performance benefit, as it avoids copies and allocations. I'm not sure if this is convenient in the C# APIs.

elephantpanda commented 2 weeks ago

Thanks I will try it. BTW I am using the net standard 2.0 API for onnxruntime. I don't know if it would make a difference using a different version like net 6.0? (I assumed it wouldn't be since it's just calling functions in the dll mostly?)

If I can give you some more information about why I want to use onnxruntime API rather than the genai API, it's because mainly I would like to have more control about manipulating the inputs and outputs. e.g. the input tokens and the output probability vectors. Which unfortunately is not accessibly currently with the genai API (even though it's good to get up and running fast which is appreciated.) In an ideal world it would be nice if these two libraries had more compatibility - such as using the same tensor format. Thanks.

These are my session options so far which I tried to copy from the genai code. Apart from the execution provider the others don't seem to have much effect:

        var options = new SessionOptions();
        options.AppendExecutionProvider_DML();
        options.GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_ALL;
        options.AddSessionConfigEntry("ep.dml.enable_graph_capture", "1");
        options.AddSessionConfigEntry("ep.dml.disable_memory_arena", "1");
        options.IntraOpNumThreads = 4;
        session = new InferenceSession(modelPath + @"\model.onnx",options);

elephantpanda commented 2 weeks ago

I changed the bound output from: logits = OrtValue.CreateTensorValueFromMemory(new Float16[VSIZE * inputLength], new long[] { 1, inputLength,VSIZE }); to logits = OrtValue.CreateAllocatedTensorValue(allocator, TensorElementType.Float16, new long[] { 1, inputLength, VSIZE });

using

var meminfo = new OrtMemoryInfo("DML", OrtAllocatorType.DeviceAllocator, 0, OrtMemType.Default);
allocator = new OrtAllocator(session, meminfo);

binding.BindOutput("logits", logits);

But now when I call:

binding.SynchronizeBoundOutputs()

it slows down again. So back to the drawing board... ☹️ Also When I try to get the read the output it crashes...

yuslepukhin commented 2 weeks ago

You do not need IOBinding. With the new OrtValue based API you can achieve the same performance and avoid much of the garbage colleciton.

https://onnxruntime.ai/docs/tutorials/csharp/basic_csharp.html

elephantpanda commented 2 weeks ago

You do not need IOBinding. With the new OrtValue based API you can achieve the same performance and avoid much of the garbage colleciton.

https://onnxruntime.ai/docs/tutorials/csharp/basic_csharp.html

OK thanks. Well that makes things a easier. 😊

I'm still not sure why my onnxruntime code is slower than the genai code. I'll see if I can share my project. Or if there is already a pure c# onnxruntime API project that someone has made for an LLM it would be nice to look at it. I think it's actually the model itself that is running faster using the genai code. There's probably some trick I missed somewhere. 🤔 Or perhaps it's just the managed dot net runtime that is missing some trick (like does it support int4?) Or perhaps there's some setting I'm missing when passing back in the cached key values. I'll keep trying at it.

RyanUnderhill commented 2 weeks ago

If I can give you some more information about why I want to use onnxruntime API rather than the genai API, it's because mainly I would like to have more control about manipulating the inputs and outputs. e.g. the input tokens and the output probability vectors. Which unfortunately is not accessibly currently with the genai API (even though it's good to get up and running fast which is appreciated.) In an ideal world it would be nice if these two libraries had more compatibility - such as using the same tensor format. Thanks.

We (the GenAI team) have been trying to figure out what types of custom scoring people will be doing to keep the API simple, can you share more about what custom scoring you're doing? We have some proposed APIs to return the logits and let you append tokens during the generation loop, but with all of the different providers (cuda/directml/etc) it's tricky optimizing the data flow to avoid copies.

A simple pseudocode of what you're doing, perhaps with an imaginary GenAI API would be great, so we can see if we can make it possible

elephantpanda commented 2 weeks ago

If I can give you some more information about why I want to use onnxruntime API rather than the genai API, it's because mainly I would like to have more control about manipulating the inputs and outputs. e.g. the input tokens and the output probability vectors. Which unfortunately is not accessibly currently with the genai API (even though it's good to get up and running fast which is appreciated.) In an ideal world it would be nice if these two libraries had more compatibility - such as using the same tensor format. Thanks.

We (the GenAI team) have been trying to figure out what types of custom scoring people will be doing to keep the API simple, can you share more about what custom scoring you're doing? We have some proposed APIs to return the logits and let you append tokens during the generation loop, but with all of the different providers (cuda/directml/etc) it's tricky optimizing the data flow to avoid copies.

A simple pseudocode of what you're doing, perhaps with an imaginary GenAI API would be great, so we can see if we can make it possible

Hi thanks for your reply. Here is an example. Well one problem I'm having is that sometimes GenAI generates a premature END token. And I want to tell it, to pick a different one. In other words I want to change the probability of certain tokens at various steps, or just to have my own custom function to select the token myself given the probabilities.

Also, just for experimentation purposes to try out different algorithms such as doing my own implementations of beam search or trying out speculative decoding (using a smaller model to predict a few tokens in advance) . It is nice to have hard-coded solutions but I'd also like the flexibility to experiment. For making an app especially in a game it is important to be able to experiment and optimise and find different "tricks".

I would be quite happy if there was a function like GetLogitsAtPosition(n) which would return the 36072 probabilities.

Here is some pseudo code for a chat-like model:

Tokens[] A
LOOP
    inputString = GetInputFromUser()
    A += TokenizeInput(inputString)
    model.SetInput(A)
    LOOP
       logits =GenerateLogits()
       UseCustomFunctionToSelectToken(logits)
    A+=output  //we want to add the output of the LLM also to the next input

So GenAi works great except for a few issues:

unexplained crash for certain specific set of input tokens with the error message not really helping.
not enough control over token selection
not sure the best way to add more tokens to the input for chat-like scenario
no way to inspect or change the NamedTensor object returned by ProcessImages to see what is inside for debugging and experimentation. (e.g. this creates an input of size about 2500 and it would be nice to experiment on smaller inputs)
Phi-3-vision not yet working for DML but the documentation page doesn't say it should - assuming that is up to date.
no way to go back a few steps and generate the tokens again.
can crash and error messages not very specific just saying something went wrong with a message sent to the GPU. (Perhaps it should just fail more gracefully and have more inciteful error messages?)

So these are my main roadblocks. For balance here are my points about why I would like to use GenAi over pure onnxruntime code:

It appears to be about 2x faster (than any code I can write so far)
It is much simpler to use without having to worry about caching, tokenization and other things.

Hope this helps 🙂

yufenglee commented 2 weeks ago

A big improvement from GenAI that is not mentioned above is that the past and present KV cache share the sample buffer, i.e., there only needs to append kv for new generated tokens to existing one. It avoids copying of past kv cache.

For the issues of genai, we can discuss in details in GenAI repro. 1 and 7 looks like same issues. we can track them with the https://github.com/microsoft/onnxruntime-genai/issues/833. For 2, could you please add more on it? For 3 and 6, we are working on it, i.e., adding support of interactive decoding. For 5, DML does work for phi-3-vision model, what issues did you hit? For 4, NamedTensor is an opaque object. If you want to debug, you have to debug with C++.

elephantpanda commented 2 weeks ago

A big improvement from GenAI that is not mentioned above is that the past and present KV cache share the sample buffer, i.e., there only needs to append kv for new generated tokens to existing one. It avoids copying of past kv cache.

Interesting perhaps that is what is giving the big speed up? 🤔Well, who knows.

For the issues of genai, we can discuss in details in GenAI repro. 1 and 7 looks like same issues. we can track them with the microsoft/onnxruntime-genai#833. For 2, could you please add more on it? For 3 and 6, we are working on it, i.e., adding support of interactive decoding. For 5, DML does work for phi-3-vision model, what issues did you hit? For 4, NamedTensor is an opaque object. If you want to debug, you have to debug with C++.

As above, I would like to be able to either use the logits for a position in my own custom function to select a token. e.g. sometimes the sampling might give me a token I don't want (for example an END token) and I want to choose another one. Or I just want to try out my own method of choosing a token that's not top-p, top-k or one of the pre-defined options. This also relates to this problem. The ability to be able to experiment with different functions is import I feel.
&6 That's good, more flexibility in going to different positions to re-generate tokens or add new tokens would be great.
I would expect a NamedTensor object to at least give be the ability to see if names and shapes of the tensors within even if readonly. Otherwise there doesn't seem to be any point in it existing as a separate entity.
I have put the error for phi-3-vision here.

Thanks.

RyanUnderhill commented 2 weeks ago

This is great to know. So for your case, would these hypothetical APIs let you do what you want?

OgaTensor generator.GetLogits(); // Return current logits
generator.AppendToken(token_id); // Manually choose the next token (won't work on beam search as is)

Would OgaTensor always being in CPU memory be a problem or would you expect it to be in DML device memory? We could be more optimal in what tokens to give you if it was the 'TopK' of the logits for example, unless you're doing something really different in the scoring? This would let us do the TopK/TopP on the accelerator and give you the small amount of resulting data to manually pick your tokens from (or override ones you don't want).

elephantpanda commented 2 weeks ago

This is great to know. So for your case, would these hypothetical APIs let you do what you want?
OgaTensor generator.GetLogits(); // Return current logits
generator.AppendToken(token_id); // Manually choose the next token (won't work on beam search as is)
Would OgaTensor always being in CPU memory be a problem or would you expect it to be in DML device memory? We could be more optimal in what tokens to give you if it was the 'TopK' of the logits for example, unless you're doing something really different in the scoring? This would let us do the TopK/TopP on the accelerator and give you the small amount of resulting data to manually pick your tokens from (or override ones you don't want).

I think that's about right. For me personally I might prefer something like generator.GetProbabilities() where it already computes the probabilities using the config file and does all the softmax etc. and then you could maybe override this with different configs generator.GetProbabilities(options). I don't know if there's any advantage in getting the raw logits but other people might have different opinions.

As for CPU, from my perspective that doesn't bother me as it's only 32064 values which is is barely anything. That's just my opinion. And I'd most likely do the calculation on the CPU.

This would get the logits/probability for only one token. Although for something like speculative decoding it requires getting the logits of more than one position in the output. So in an ideal world this would be supported too. e.g. generator.GetProbabilitiesForNextNTokensInOutput() might not be possible if the output length is 1(?). You can get a 2-4x speed up with speculative decoding (using a smaller assistant LLM to predict a few tokens ahead) but this is not a deal-breaker 🙂

P.S. As well as AppendToken() might as well have a RemoveLastToken() as that might come in useful.

yufenglee commented 2 weeks ago

A big improvement from GenAI that is not mentioned above is that the past and present KV cache share the sample buffer, i.e., there only needs to append kv for new generated tokens to existing one. It avoids copying of past kv cache.

Interesting perhaps that is what is giving the big speed up? 🤔Well, who knows.

For the issues of genai, we can discuss in details in GenAI repro. 1 and 7 looks like same issues. we can track them with the microsoft/onnxruntime-genai#833. For 2, could you please add more on it? For 3 and 6, we are working on it, i.e., adding support of interactive decoding. For 5, DML does work for phi-3-vision model, what issues did you hit? For 4, NamedTensor is an opaque object. If you want to debug, you have to debug with C++.

As above, I would like to be able to either use the logits for a position in my own custom function to select a token. e.g. sometimes the sampling might give me a token I don't want (for example an END token) and I want to choose another one. Or I just want to try out my own method of choosing a token that's not top-p, top-k or one of the pre-defined options. This also relates to this problem. The ability to be able to experiment with different functions is import I feel.

&6 That's good, more flexibility in going to different positions to re-generate tokens or add new tokens would be great.

I would expect a NamedTensor object to at least give be the ability to see if names and shapes of the tensors within even if readonly. Otherwise there doesn't seem to be any point in it existing as a separate entity.

I have put the error for phi-3-vision here.

Thanks.

Yes, you can try disabling the past_present_share_buffer option and will be able to see the difference.

elephantpanda commented 2 weeks ago

I tried it. Unfortunately it gives me an error if I disable it (is this expected?):

   "search": {
        "diversity_penalty": 0.0,
        "do_sample": true,
        "early_stopping": true,
        "length_penalty": 1.0,
        "max_length": 4096,
        "min_length": 0,
        "no_repeat_ngram_size": 0,
        "num_beams": 1,
        "num_return_sequences": 1,
        "past_present_share_buffer": false,
        "repetition_penalty": 1.0,
        "temperature": 1,
        "top_k": 0,
        "top_p": 1.0
    }

Here is the error:

OnnxRuntimeGenAIException: Non-zero status code returned while running DmlFusedNode_0_0 node. Name:'DmlFusedNode_0_0' Status Message: D:\a\_work\1\s\onnxruntime\core\framework\execution_frame.cc:173 onnxruntime::IExecutionFrame::GetOrCreateNodeOutputMLValue shape && tensor.Shape() == *shape was false. OrtValue shape verification failed. Current shape:{1,32,11,96} Requested shape:{1,32,4096,96}

Microsoft.ML.OnnxRuntimeGenAI.Result.VerifySuccess (System.IntPtr nativeResult) (at D:/a/_work/1/onnxruntime-genai/src/csharp/Result.cs:26)
Microsoft.ML.OnnxRuntimeGenAI.Generator.ComputeLogits () (at D:/a/_work/1/onnxruntime-genai/src/csharp/Generator.cs:25)

(The context length is 4096 and my input string comes to 11 tokens) Do I have to pad the input?

yufenglee commented 2 weeks ago

I tried it. Unfortunately it gives me an error if I disable it (is this expected?):

   "search": {
        "diversity_penalty": 0.0,
        "do_sample": true,
        "early_stopping": true,
        "length_penalty": 1.0,
        "max_length": 4096,
        "min_length": 0,
        "no_repeat_ngram_size": 0,
        "num_beams": 1,
        "num_return_sequences": 1,
        "past_present_share_buffer": false,
        "repetition_penalty": 1.0,
        "temperature": 1,
        "top_k": 0,
        "top_p": 1.0
    }

Here is the error:

OnnxRuntimeGenAIException: Non-zero status code returned while running DmlFusedNode_0_0 node. Name:'DmlFusedNode_0_0' Status Message: D:\a\_work\1\s\onnxruntime\core\framework\execution_frame.cc:173 onnxruntime::IExecutionFrame::GetOrCreateNodeOutputMLValue shape && tensor.Shape() == *shape was false. OrtValue shape verification failed. Current shape:{1,32,11,96} Requested shape:{1,32,4096,96}

Microsoft.ML.OnnxRuntimeGenAI.Result.VerifySuccess (System.IntPtr nativeResult) (at D:/a/_work/1/onnxruntime-genai/src/csharp/Result.cs:26)
Microsoft.ML.OnnxRuntimeGenAI.Generator.ComputeLogits () (at D:/a/_work/1/onnxruntime-genai/src/csharp/Generator.cs:25)

(The context length is 4096 and my input string comes to 11 tokens) Do I have to pad the input?

i see. You're using DML. It is required for DML EP.

RyanUnderhill commented 2 weeks ago

I think that's about right. For me personally I might prefer something like generator.GetProbabilities() where it already computes the probabilities using the config file and does all the softmax etc. and then you could maybe override this with different configs generator.GetProbabilities(options). I don't know if there's any advantage in getting the raw logits but other people might have different opinions.

As for CPU, from my perspective that doesn't bother me as it's only 32064 values which is is barely anything. That's just my opinion. And I'd most likely do the calculation on the CPU.

This would get the logits/probability for only one token. Although for something like speculative decoding it requires getting the logits of more than one position in the output. So in an ideal world this would be supported too. e.g. generator.GetProbabilitiesForNextNTokensInOutput() might not be possible if the output length is 1(?). You can get a 2-4x speed up with speculative decoding (using a smaller assistant LLM to predict a few tokens ahead) but this is not a deal-breaker 🙂

P.S. As well as AppendToken() might as well have a RemoveLastToken() as that might come in useful.

Returning the raw logits is the clearest for an API like this. Softmax is just one of the internal steps that might be used in processing the logits, and there are variations on it.

For speculative decoding, it sounds like you need to have 'GetLogits()' be sized to match the number of tokens added. So when adding multiple speculated tokens, you'd get back the same count in the returned logits.

For 'RemoveLastToken()' we are planning on adding a 'Rewind()' function that lets you rewind the generation process by any number of tokens. This should cover what you need.

elephantpanda commented 2 weeks ago

Yes that sounds like it covers everything 🙂. I can't think of any other things but other people might have some ideas.

(Just to be clear with the speculative decoding it's getting the logits (or predicted token) from the output for several positions in a single iteration - a new token plus the past N tokens. Rather than accumulating it over several iterations. Then looking at the past N tokens and seeing which are predicted correctly and rejecting the others.) It's probably not a big deal at the moment since it would require a smaller model compatible with the phi-3 tokenizer which I'm not sure if there is one at the moment. It works best for highly predictable text, like code or speech recognition (like Whisper). I have tried this before with other models and could get up to 2x speed up sometimes more. So it's worth supporting it I think if possible.

There's also even more complicated versions of this using batches, which I just learned about today!

Another thing logits would be useful for is to calculate the average "confidence score" of a sentence by doing some average over the probabilities that were used to select each token.