microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime
MIT License
495 stars 123 forks source link

Certain prompts crash for Phi 3 mini int4 DML (with simpler example provided) #833

Open elephantpanda opened 2 months ago

elephantpanda commented 2 months ago

(C# DirectML int4 phi 3 mini onnx) Using genai api.

Very specific certain prompts crash. Although I haven't yet found a pattern. It isn't to do with the length of the prompt either since certain short prompts (300 tokens) crash and some longer ones (500) tokens crash:

<|user|>Let us role play. I am a human and you are a friendly, intelligent, witty robot called Robbie who lives in Texas. (You can make up a backstory about yourself). You always try to give the right answers. From now on just reply as if you are the robot without further explanation.<|end|><|assistant|>Hi, Bob. I am a friendly robot. What would you like to know?<|end|><|user|>Describe the image.<|end|><|assistant|> As a robot, I can provide guidance but an image description would require visual data. Given a hypothetical situation however, imagine standing in the heart of Texas, on a sunny afternoon. Beautiful blue sky, bright sun, and a picturesque view of a vast open field as far as the eye can see. In the foreground, a table is set for two with a classic American dinner spread - a long, savory BBQ brisket paired with rich, corn-based polenta and a delectable batch of Texas-style corn pudding. Around the table sits a couple of human friends enjoying good company, sipping homemade, refreshing lemonade, and indulging in friendly, lighthearted conversation. How does that sound?<|end|><|user|>That sounds swell.<|end|><|assistant|> It certainly is lovely to imagine, Bob! Every detail has its charm - the rustic table setting, the BBQ dinner, the well-rounded Texan cuisine, and most importantly, the joy shared by good company. In my case, the rustic aesthetics of Texas provide an inspiring programming canvas. Whether it's the desert landscapes or Texas cuisine, Texas indeed has its unique appeal.<|end|><|user|>Tell, me, in one word. What is the capital of France?<|end|><|assistant|>Paris, Bob. However, since you requested a one-word answer, it would be "Paris". This city is known as not just the capital but also the heart of French culture, art, and history. As a robot, I have an extensive database on globally recognized facts, safeguarding the depth and detail of knowledge like Paris, as the capital, is embedded in every learning algorithm of mine. I reserve all knowledge to encourage learning, exploration, and shared wisdom.<|end|><|user|>Thankyou.<|end|><|assistant|><|end|><|user|>Anything else you'd like to say?<|end|><|assistant|>

This prompt crashes after generating one token and gives:

OnnxRuntimeGenAIException: D:\a\_work\1\onnxruntime-genai\src\dml\dml_command_recorder.cpp(143)\onnxruntime-genai.dll!00007FFF2F126323: (caller: 00007FFF2F127E45) Exception(10) tid(2040) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.

Microsoft.ML.OnnxRuntimeGenAI.Result.VerifySuccess (System.IntPtr nativeResult) (at D:/a/_work/1/onnxruntime-genai/src/csharp/Result.cs:26)
Microsoft.ML.OnnxRuntimeGenAI.Generator.ComputeLogits () (at D:/a/_work/1/onnxruntime-genai/src/csharp/Generator.cs:25)
Main.Generate () (at Assets/Main.cs:174)
System.Threading.Tasks.Task.InnerInvoke () (at <9d9536d9127f4a489d989c7a566aee1c>:0)
System.Threading.Tasks.Task.Execute () (at <9d9536d9127f4a489d989c7a566aee1c>:0)
--- End of stack trace from previous location where exception was thrown ---
Main.Update () (at Assets/Main.cs:152)
System.Runtime.CompilerServices.AsyncMethodBuilderCore+<>c.<ThrowAsync>b__7_0 (System.Object state) (at <9d9536d9127f4a489d989c7a566aee1c>:0)
UnityEngine.UnitySynchronizationContext+WorkRequest.Invoke () (at <120b8d04add741329ccd415c000fb666>:0)
UnityEngine.UnitySynchronizationContext.Exec () (at <120b8d04add741329ccd415c000fb666>:0)
UnityEngine.UnitySynchronizationContext.ExecuteTasks () (at <120b8d04add741329ccd415c000fb666>:0)

If I take away a few words or add a few more then it will not crash. Not sure why, or if it is certain specific lengths of inputs.

yufenglee commented 2 months ago

Are you generating the int4 model or fp16 model? Could you please share the command you used to build the model?

elephantpanda commented 2 months ago

This is the model here prebuilt. And using code adapted from the c# genai api to run it. (Setting sampling to false)

I suspect it's either a bug in the ComputeLogits function or the onnxruntime Directml library.

P5000 GPU. Windows 10. 12GB RAM. 16GB VRAM.

elephantpanda commented 2 months ago

(Same bug in version 0.4.0)

elephantpanda commented 2 months ago

Here is a simpler example. I just generate an input of 360 zero's for the inputIDs:


 int[] AllSame(int len, int val)
    {
        int[] result = new int[len];
        for (int i = 0; i < len; i++) result[i] = val;
        return result;
    }
generatorParams.SetInputIDs(AllSame(360,0), (ulong)360, 1);

This always crashes on ComputeLogits() the second time it is called but:

generatorParams.SetInputIDs(AllSame(360,123), (ulong)360, 1);

Doesn't crash. Also:

generatorParams.SetInputIDs(AllSame(350,0), (ulong)350, 1);

Doesn't crash.

So there's no rhyme or reason it it. Certain inputs are crashing after generating one token, and other's don't. If it crashes it always crashes the second time ComputeLogits() is called. Otherwise it won't crash at all and continue generating tokens.

Can anyone else replicate this? This is very bad ☹️

natke commented 2 months ago

Thank you for raising this. We will look into it. Did you try running the CPU version?

elephantpanda commented 2 months ago

Thank you for raising this. We will look into it. Did you try running the CPU version?

I am using this model for onnx DML.

I tried it in CPU mode by setting the provider options in the config to []. And it is very slow because it was maxing out my RAM but no crash as far as I can see using the inputs above so looks like only crashes in DML mode.

(BTW is there a way to set the configs at runtime rather than editing the config file? That would be useful, so an app user could change the settings in a nice GUI).

Let me know if you need any more information.

My uneducated guess would be the first time ComputeLogits() is called with a particular set of tokens, something is getting corrupted or some setting is wrong, so the next time it is called it crashes. 🤔That would kind of explain it.

natke commented 2 months ago

Yes, you can set the configs on the GeneratorParams. You can see an example here: https://github.com/microsoft/onnxruntime-genai/blob/main/examples/csharp/Genny/Genny/Controls/SearchOptionsControl.xaml.cs

natke commented 2 months ago

And thank you for the extra information. We are following up

elephantpanda commented 2 months ago

Hi @natke Did you manage to replicate the bug? If not I could try and send you some more code or sample application.

AFAIK This is basically my only roadblock from using genai in production.

skyline75489 commented 2 months ago

Hi @elephantpanda Can you provide us with more code including the detailed config? For example min_length and max_length?

elephantpanda commented 2 months ago

Hi @elephantpanda Can you provide us with more code including the detailed config? For example min_length and max_length?

Hi, I've found that the same error also happens with python code for certain prompts, so I don't think it is c# specific, just DML specific:

import onnxruntime_genai as ai

print("Version = "+str(ai.__version__))
print("DML available="+str(ai.is_dml_available()))

model = ai.Model("D:\Phi3Onnx")
#change this to N=1 to make it pass
N=0
print("Doing Test "+str(N))

generatorParams = ai.GeneratorParams(model)
generatorParams.input_ids = [N]*500
generator = ai.Generator(model,generatorParams)

for n in range(5):
    print(n)
    generator.compute_logits()
    generator.generate_next_token()

print("TEST "+str(N)+" PASSED")

Output:

Version = 0.4.0
DML available=True
Doing Test 0
0
1
Traceback (most recent call last):
  File "test.py", line 18, in <module>
    generator.compute_logits()
RuntimeError: D:\a\_work\1\onnxruntime-genai\src\dml\dml_command_recorder.cpp(143)\onnxruntime_genai.cp38-win_amd64.pyd!00007FFB6D37D493: (caller: 
00007FFB6D371395) Exception(1) tid(560) 887A0006 The GPU will not respond to more commands, most likely because of an invalid command passed by the calling application.

Changing the line to N=1 makes it not crash:

Version = 0.4.0
DML available=True
Doing Test 1
0
1
2
3
4
TEST 1 PASSED

As you can see when it crashes it crashes on the 2nd time compute_logits() is called otherwise it will continue forever. You can experiment with different values of N and different lengths to see which inputs pass and which fail. It is not just to do with the length. Sometimes you have to run the test twice to get the error.

My VRAM (16GB P5000 GPU) is only ever gets to about 25% of usage.

The model is from here and this is the config file:

{
    "model": {
        "bos_token_id": 1,
        "context_length": 4096,
        "decoder": {
            "session_options": {
                "log_id": "onnxruntime-genai",
                "provider_options": [
                    {
                        "dml": {}
                    }
                ]
            },
            "filename": "model.onnx",
            "head_size": 96,
            "hidden_size": 3072,
            "inputs": {
                "input_ids": "input_ids",
                "attention_mask": "attention_mask",
                "past_key_names": "past_key_values.%d.key",
                "past_value_names": "past_key_values.%d.value"
            },
            "outputs": {
                "logits": "logits",
                "present_key_names": "present.%d.key",
                "present_value_names": "present.%d.value"
            },
            "num_attention_heads": 32,
            "num_hidden_layers": 32,
            "num_key_value_heads": 32
        },
        "eos_token_id": [
            32000,
            32001,
            32007
        ],
        "pad_token_id": 32000,
        "type": "phi3",
        "vocab_size": 32064
    },
    "search": {
        "diversity_penalty": 0.0,
        "do_sample": false,
        "early_stopping": true,
        "length_penalty": 1.0,
        "max_length": 4096,
        "min_length": 0,
        "no_repeat_ngram_size": 0,
        "num_beams": 1,
        "num_return_sequences": 1,
        "past_present_share_buffer": true,
        "repetition_penalty": 1.0,
        "temperature": 1,
        "top_k": 0,
        "top_p": 1.0
    }
}

Also, in CPU mode there is no error (although it does use a huge amount of RAM).

skyline75489 commented 1 month ago

With the C# example, I'm seeing another error on RTX3060:

Microsoft.ML.OnnxRuntimeGenAI.OnnxRuntimeGenAIException: 'Non-zero status code returned while running DmlFusedNode_0_0 node. Name:'DmlFusedNode_0_0' Status Message: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlGraphFusionHelper.cpp(353)\onnxruntime.dll!00007FFCAA7F254C: (caller: 00007FFCAA7EB7CF) Exception(1) tid(32cc) 80070057 The parameter is incorrect. '

skyline75489 commented 1 month ago

@elephantpanda I cannot repro with the python code you provided above, though. Looks like different card & driver combination matters

elephantpanda commented 1 month ago

@elephantpanda I cannot repro with the python code you provided above, though. Looks like different card & driver combination matters

Yes, each GPU could fail differently on the same issue.

If you want the exact setup I'm using you can replicate it here or here . (Standard setup P5000 GPU equivalent to NVidia GTX 1080). It is quite a common setup so if it doesn't work on this it's quite worrying. 😬

But perhaps the bug you found will fix the issue.

natke commented 1 month ago

We reproduced this on an NVIDIA RTX A2000. We will keep looking into it

elephantpanda commented 1 month ago

Any update on this bug? Is it expected to be fixed in the next version? (I see no-one is assigned yet to it). Or it is on the back burner? For me, onnx runtime genai is not stable enough to be used in production unless this is fixed. If it is fixed by Christmas that would be fine by me but if it is not going to be fixed I need to think of other options. Thankyou for your time.

natke commented 1 month ago

We do not have an update on this yet @elephantpanda, but we are trying to get support from the relevant folks. In the mean time, have you tried using the CUDA variants of the library and model? These might unblock you while we fix this issue

elephantpanda commented 1 month ago

We do not have an update on this yet @elephantpanda, but we are trying to get support from the relevant folks. In the mean time, have you tried using the CUDA variants of the library and model? These might unblock you while we fix this issue

Hi thanks for the update. 🙂Much appreciated. Yes, I could probably use the CUDA version now for testing. For distribution I would really like to use the DML version since it targets more devices and has a much smaller library size. So that would be my ideal situation. Thanks again.

PatriceVignola commented 1 month ago

Hi @elephantpanda,

I tried the C# sample with the prompt that you put in this thread on a P2000 and wasn't able to reproduce the failure. I don't have access to a P5000, but P2000 has the same architecture and is actually weaker, so I would expect to be able to repro on both.

In the binary folder, next to the executable, do you see the DirectML.dll DLL? When building the project in Visual Studio, if the "Any CPU" configuration is selected, the DirectML DLLs won't be moved over to the executable's folder and will fall back to using the System32 one instead, which will fail. The x64 configuration needs to be manually selected.

Otherwise, can you paste the exact C# file that you've been using? Here's what I used on the P2000:

// See https://aka.ms/new-console-template for more information
using Microsoft.ML.OnnxRuntimeGenAI;

void PrintUsage()
{
    Console.WriteLine("Usage:");
    Console.WriteLine("  -m model_path");
    Console.WriteLine("  -i (optional): Interactive mode");
}

using OgaHandle ogaHandle = new OgaHandle();

if (args.Length < 1)
{
    PrintUsage();
    Environment.Exit(-1);
}

bool interactive = false;
string modelPath = string.Empty;

uint i = 0;
while (i < args.Length)
{
    var arg = args[i];
    if (arg == "-i")
    {
        interactive = true;
    }
    else if (arg == "-m")
    {
        if (i + 1 < args.Length)
        {
            modelPath = Path.Combine(args[i+1]);
        }
    }
    i++;
}

if (string.IsNullOrEmpty(modelPath))
{
    throw new Exception("Model path must be specified");
}

Console.WriteLine("-------------");
Console.WriteLine("Hello, Phi!");
Console.WriteLine("-------------");

Console.WriteLine("Model path: " + modelPath);
Console.WriteLine("Interactive: " + interactive);

using Model model = new Model(modelPath);
using Tokenizer tokenizer = new Tokenizer(model);

var option = 2;
if (interactive)
{
    Console.WriteLine("Please enter option number:");
    Console.WriteLine("1. Complete Output");
    Console.WriteLine("2. Streaming Output");
    int.TryParse(Console.ReadLine(), out option);
}

do
{
    string prompt = "<|user|>Let us role play. I am a human and you are a friendly, intelligent, witty robot called Robbie who lives in Texas. (You can make up a backstory about yourself). You always try to give the right answers. From now on just reply as if you are the robot without further explanation.<|end|><|assistant|>Hi, Bob. I am a friendly robot. What would you like to know?<|end|><|user|>Describe the image.<|end|><|assistant|> As a robot, I can provide guidance but an image description would require visual data. Given a hypothetical situation however, imagine standing in the heart of Texas, on a sunny afternoon. Beautiful blue sky, bright sun, and a picturesque view of a vast open field as far as the eye can see. In the foreground, a table is set for two with a classic American dinner spread - a long, savory BBQ brisket paired with rich, corn-based polenta and a delectable batch of Texas-style corn pudding. Around the table sits a couple of human friends enjoying good company, sipping homemade, refreshing lemonade, and indulging in friendly, lighthearted conversation. How does that sound?<|end|><|user|>That sounds swell.<|end|><|assistant|> It certainly is lovely to imagine, Bob! Every detail has its charm - the rustic table setting, the BBQ dinner, the well-rounded Texan cuisine, and most importantly, the joy shared by good company. In my case, the rustic aesthetics of Texas provide an inspiring programming canvas. Whether it's the desert landscapes or Texas cuisine, Texas indeed has its unique appeal.<|end|><|user|>Tell, me, in one word. What is the capital of France?<|end|><|assistant|>Paris, Bob. However, since you requested a one-word answer, it would be \"\"Paris\"\". This city is known as not just the capital but also the heart of French culture, art, and history. As a robot, I have an extensive database on globally recognized facts, safeguarding the depth and detail of knowledge like Paris, as the capital, is embedded in every learning algorithm of mine. I reserve all knowledge to encourage learning, exploration, and shared wisdom.<|end|><|user|>Thankyou.<|end|><|assistant|><|end|><|user|>Anything else you'd like to say?<|end|><|assistant|>";
    if (interactive)
    {
        Console.WriteLine("Prompt:");
        prompt = Console.ReadLine();
    }
    if (string.IsNullOrEmpty(prompt))
    {
        continue;
    }
    var sequences = tokenizer.Encode($"<|user|>{prompt}<|end|><|assistant|>");

    using GeneratorParams generatorParams = new GeneratorParams(model);
    generatorParams.SetSearchOption("min_length", 50);
    generatorParams.SetSearchOption("max_length", 600);
    generatorParams.SetInputSequences(sequences);
    if (option == 1) // Complete Output
    {
        var watch = System.Diagnostics.Stopwatch.StartNew();
        var outputSequences = model.Generate(generatorParams);
        var outputString = tokenizer.Decode(outputSequences[0]);
        watch.Stop();
        var runTimeInSeconds = watch.Elapsed.TotalSeconds;
        Console.WriteLine("Output:");
        Console.WriteLine(outputString);
        var totalTokens = outputSequences[0].Length;
        Console.WriteLine($"Tokens: {totalTokens} Time: {runTimeInSeconds:0.00} Tokens per second: {totalTokens / runTimeInSeconds:0.00}");
    }

    else if (option == 2) //Streaming Output
    {
        using var tokenizerStream = tokenizer.CreateStream();
        using var generator = new Generator(model, generatorParams);
        var watch = System.Diagnostics.Stopwatch.StartNew();
        while (!generator.IsDone())
        {
            generator.ComputeLogits();
            generator.GenerateNextToken();
            Console.Write(tokenizerStream.Decode(generator.GetSequence(0)[^1]));
        }
        Console.WriteLine();
        watch.Stop();
        var runTimeInSeconds = watch.Elapsed.TotalSeconds;
        var outputSequence = generator.GetSequence(0);
        var totalTokens = outputSequence.Length;
        Console.WriteLine($"Streaming Tokens: {totalTokens} Time: {runTimeInSeconds:0.00} Tokens per second: {totalTokens / runTimeInSeconds:0.00}");
    }
} while (interactive);
elephantpanda commented 1 month ago

Hi @elephantpanda,

I tried the C# sample with the prompt that you put in this thread on a P2000 and wasn't able to reproduce the failure. I don't have access to a P5000, but P2000 has the same architecture and is actually weaker, so I would expect to be able to repro on both.

Hi @PatriceVignola If you look four posts up you will see @natke says they have been able to reproduce the bug on NVIDIA RTX A2000. My suggestion would be to ask them how they reproduced the bug and what code they used to reproduce it on this GPU.

"We reproduced this on an NVIDIA RTX A2000. We will keep looking into it"

It is definitely using the correct DirectML.dll that is supplied with onnxruntime genai.

Ignore the first post with the long prompt. It is much easier to reproduce like this: If you see in my second post I modified the code to set the input ids as:

 int[] AllSame(int len, int val)
    {
        int[] result = new int[len];
        for (int i = 0; i < len; i++) result[i] = val;
        return result;
    }
generatorParams.SetInputIDs(AllSame(360,0), (ulong)360, 1);

So you can increase this number or vary the id's. As it is one of those bugs that is probably different for different GPUs. But the best bet is to ask @natke since they have said they have reproduced the bug.

It is quite a serious bug since it breaks only for certain lists of token ids. Which means that you can be having a conversation with the AI and it will suddenly crash due to hitting this bug. And there is no way to know which list of token ids it will break on. Although for my particular set up it crashes with the above code of 360 or more zeros. (As well as other list of ids). In other words anyone using genai in their project, their app could suddenly crash, which is not a good thing. 🙂

As you can see I also supplied some python code above which also hits the same bug as it is a DML related issue not a c# related one. The python code may be easier for your testing.

For reference here it is again:

import onnxruntime_genai as ai

print("Version = "+str(ai.__version__))
print("DML available="+str(ai.is_dml_available()))

model = ai.Model("D:\Phi3Onnx")
#change this to N=1 to make it pass
N=0
print("Doing Test "+str(N))

generatorParams = ai.GeneratorParams(model)
generatorParams.input_ids = [N]*500
generator = ai.Generator(model,generatorParams)

for n in range(5):
    print(n)
    generator.compute_logits()
    generator.generate_next_token()

print("TEST "+str(N)+" PASSED")

and a link to the model.