microsoft / semantic-kernel

Integrate cutting-edge LLM technology quickly and easily into your apps
https://aka.ms/semantic-kernel
MIT License
21.82k stars 3.25k forks source link

.Net Slow response time #4139

Closed Alerinos closed 10 months ago

Alerinos commented 10 months ago

I did an upgrade from beta-8 to rc-3. After the upgrade, the content generation time increased x10 times. On top of that, every time I execute the prompt function I have a ram memory leak. image

My code:

        var variables = new KernelArguments
        {
            ["Topic"] = topic,
        };

        var getIntentFunction = _context.Kernel
            .CreateFunctionFromPrompt(Prompt, requestSettings, "GetIntent");

        var result = await _context.Kernel
            .InvokeAsync(getIntentFunction, variables);

I tested on GPT3.5 Turbo

Alerinos commented 10 months ago

I have the impression that the longer the input the longer the response time. This wasn't the case before, using another library I don't have a problem with this either. My input even has 2k tokens.

GPT4-Turbro
Time: 00:00:28.1610721 Token: 172 Prompt: 176
Time: 00:00:52.4367247 Token: 787 Prompt: 271
Time: 00:00:30.0494350 Token: 680 Prompt: 1053
Time: 00:00:47.9093970 Token: 696 Prompt: 1734
Time: 00:00:49.1772074 Token: 900 Prompt: 2431
matthewbolanos commented 10 months ago

Yes, the longer the input, the longer the response time because it takes time for the model to process tokens. That being said... you shouldn't be seeing a memory leak.

crickman commented 10 months ago

@Alerinos - I will proceed with memory leak analysis and assume some values for aiSettings and Prompt using both the RC3 nuget package and the current repo-state.

Could you please also share if you are targeting OpenAI hosted models or an Azure OpenAI deployment?

crickman commented 10 months ago

I've ran in a couple modes:

  1. Kernel with function calling: no completion endpoint or prompt based functions

    Screenshot 2023-12-11 152134
  2. Kernel with completion: calling open ai endpoint with prompt based functions

    image

Both of these cases are creating multiple kernels and calling multiple functions and appear to be achieving a stead-state w.r.t. memory management.

If you have specific steps that lead to a memory leak, I'd be happy to adjust my approach.

One thing I might try is to initialize your function only once and then utilize it multiple times:

   var getIntentFunction = _context.Kernel.CreateFunctionFromPrompt(Prompt, requestSettings, "GetIntent");
matthewbolanos commented 10 months ago

@crickman, can you share the code you used for your test to see if @Alerinos, can run it on his side to validate that there aren't memory leaks?

crickman commented 10 months ago

Sure, I started by working in the repo to evaluate the current state and just tweaked KernelSyntaxExample9 and ran it in the performance profiler.

The profiler is able to show dead objects and where they are pinned.

I suspect the original memory profile is affected by additional memory pressure from:

  1. Repeated function creation (expensive)
  2. Other implementation specific details

I am curious if the op endpoint is OpenAI or Azure AI.

using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.SemanticKernel;
using RepoUtils;

public static class Example09_FunctionTypes
{
    public static async Task RunAsync()
    {
        Console.WriteLine($"Process ID: {Environment.ProcessId}");

        // Pick one:

        // core kernel...no services
        await RunStressAsync(150, StressEmptyAsync);

        // chat completion only
        await RunStressAsync(30, StressCompletionAsync);

        // function calling only
        await RunStressAsync(150, StressFunctionCallingAsync);
    }

    private static async Task RunStressAsync(int iterationCount, Func<Task> payloadAsync)
    {
        Console.ReadLine(); // Pause for snapshot

        var count = 0;
        var shouldExit = false;

        while (!shouldExit)
        {
            Console.WriteLine($"#### {count + 1}");

            await payloadAsync.Invoke();
            await Task.Delay(300);

            ++count;

            shouldExit = (count >= iterationCount);
        }

        ReleaseMemory();
        ReleaseMemory();
        ReleaseMemory();

        Console.ReadLine(); // Pause for snapshot
    }

    private static Task StressEmptyAsync()
    {
        var kernel = Kernel.CreateBuilder().Build();

        Console.WriteLine(kernel.Plugins.Count);

        return Task.CompletedTask;
    }

    private static async Task StressCompletionAsync()
    {
        var kernel =
            Kernel
                .CreateBuilder()
                .AddOpenAIChatCompletion(TestConfiguration.OpenAI.ChatModelId, TestConfiguration.OpenAI.ApiKey)
                .Build();

        string folder = RepoFiles.SamplePluginsPath();
        kernel.ImportPluginFromPromptDirectory(Path.Combine(folder, "SummarizePlugin"));

        var plugin = new LocalExamplePlugin();

        await Task.WhenAll(
            plugin.Type04Async(kernel, $"wut be prime: {Random.Shared.Next()}"),
            plugin.Type04Async(kernel, $"is {Random.Shared.Next()} a prime number"),
            plugin.Type04Async(kernel, $"I'm going on a trip to Paris next month on the 3rd for {Random.Shared.Next(3, 8)} days."),
            plugin.Type04Async(kernel, $"do you know {Random.Shared.Next(2, 4)} \"fun facts\" about honey bees"),
            plugin.Type04Async(kernel, $"What slang was popular in 19{Random.Shared.Next(5, 9)}7"));
    }

    private static async Task StressFunctionCallingAsync()
    {
        var kernel = Kernel.CreateBuilder().Build();

        var plugin = kernel.ImportPluginFromType<LocalExamplePlugin>("test");

        for (int i = 0; i < 100; ++i)
        {
            await kernel.InvokeAsync(plugin["type01"]);
            await kernel.InvokeAsync(kernel.Plugins["test"]["type01"]);

            await kernel.InvokeAsync(plugin["type02"]);
            await kernel.InvokeAsync(kernel.Plugins["test"]["type02"]);

            await kernel.InvokeAsync(plugin["type03"]);
            await kernel.InvokeAsync(kernel.Plugins["test"]["type03"]);

            await kernel.InvokeAsync(plugin["type05"]);
            await kernel.InvokeAsync(kernel.Plugins["test"]["type05"]);

            await kernel.InvokeAsync(plugin["type06"]);
            await kernel.InvokeAsync(kernel.Plugins["test"]["type06"]);

            await kernel.InvokeAsync(plugin["type07"]);
            await kernel.InvokeAsync(kernel.Plugins["test"]["type07"]);

            await kernel.InvokeAsync(plugin["type08"]);
            await kernel.InvokeAsync(kernel.Plugins["test"]["type08"]);

            await kernel.InvokeAsync(plugin["type09"]);
            await kernel.InvokeAsync(kernel.Plugins["test"]["type09"]);

            await kernel.InvokeAsync(plugin["type10"]);
            await kernel.InvokeAsync(kernel.Plugins["test"]["type11"]);
        }
    }

    private static void ReleaseMemory()
    {
        Task.Delay(1000).Wait();
        for (int generation = 0; generation <= 2; generation++)
        {
            Console.WriteLine($"GC: {generation}");
            GC.Collect(generation, GCCollectionMode.Forced, true, true);
        }
    }
}

public class LocalExamplePlugin
{
    [KernelFunction]
    public void Type01()
    {
        Console.WriteLine("Running function type 1");
    }

    [KernelFunction]
    public string Type02()
    {
        Console.WriteLine("Running function type 2");
        return "";
    }

    [KernelFunction]
    public async Task<string> Type03Async()
    {
        await Task.Delay(0);
        Console.WriteLine("Running function type 3");
        return "";
    }

    [KernelFunction]
    public async Task<string> Type04Async(Kernel kernel, string input)
    {
        var summary = await kernel.InvokeAsync(kernel.Plugins["SummarizePlugin"]["Summarize"], new(input));
        Console.WriteLine($"Running function type 4 [{summary}]");
        return "";
    }

    [KernelFunction]
    public void Type05(string x)
    {
        Console.WriteLine("Running function type 5");
    }

    [KernelFunction]
    public string Type06(string x)
    {
        Console.WriteLine("Running function type 6");
        return "";
    }

    [KernelFunction]
    public async Task<string> Type07Async(string x)
    {
        await Task.Delay(0);
        Console.WriteLine("Running function type 07");
        return "";
    }

    [KernelFunction]
    public async Task Type08Async(string x)
    {
        await Task.Delay(0);
        Console.WriteLine("Running function type 08");
    }

    [KernelFunction]
    public async Task Type09Async()
    {
        await Task.Delay(0);
        Console.WriteLine("Running function type 09");
    }

    [KernelFunction]
    public FunctionResult Type10()
    {
        Console.WriteLine("Running function type 10");
        return new FunctionResult(KernelFunctionFactory.CreateFromMethod(() => { }));
    }

    [KernelFunction]
    public async Task<FunctionResult> Type11Async()
    {
        await Task.Delay(0);
        Console.WriteLine("Running function type 10");
        return new FunctionResult(KernelFunctionFactory.CreateFromMethod(() => { }));
    }
}
Alerinos commented 10 months ago

@crickman

Thank you for engaging with the topic. I am using OpenAI API, GPT-3 Turbo model and GPT-4. My system has a procedure consisting of multiple queries to OpenAI. 1.) It generates the content 2.) Waits for the result, generates another content x4 times 3.) Based on the total result, generates another content x2

In total, I have up to 10 queries to the API before I get the final correct result. Unfortunately, I can't do it in one query because AI is not able to grasp so many things at once.

I just did a test, generated 10 processes (10x10=100 queries to the API). From a starting 170mb of ram usage jumped to 500mb).

Each time at process startup I create my own context KernelBuilder. Maybe Dispose would be useful?

crickman commented 10 months ago

Thanks for the information. One clarification, I'm not suggesting you perform one query / chat-completion. I'm suggesting that each prompt function is only initialized once (as a singleton)