.Net: Delay after a semantic function has been called

azure-claudiu commented 3 months ago

When calling Semantic Kernel with InvokePromptAsync with ToolCallBehavior.AutoInvokeKernelFunctions, one can learn from the debug traces that the correct function is called, and the duration it took to produce the result, but then the library doesn't return immediately to the calling code, proven by the fact that the total time traced is much higher. See below for the duration it took to call the function (3.755 s) versus the total time it took to return to the caller (22.634 s).

How can this time be reduced? Is this time spent in the Semantic Kernel library, or waiting on other calls? The observability traces do not show what is happening in between. This creates the perception that the app/chatbot is slow when using Semantic Kernel. A chat completion call directly to the LLM is very fast in comparison.

To Reproduce Steps to reproduce the behavior:

Register at least 2 plugins (with one function each).
Craft a prompt that would end up calling one of the functions.
Call await kernel.InvokePromptAsync(prompt, new OpenAIPromptExecutionSettings { ToolCallBehavior = ToolCallBehavior.AutoInvokeKernelFunctions })
Observe from debug traces the correct function gets called, and its duration
Observe from debug traces that more time passes before the call returns to the caller
Observe from debug traces the total duration is much larger than the time reported for the function call

Expected behavior A time delay does not occur, or a trace explains what is being waited on.

Example output Observe below the difference between the function call duration (3.755 s) and the total duration (22.634 s).

GenFewSentences: Information: Function GenFewSentences succeeded.
GenFewSentences: Trace: Function result: [REDACTED].
GenFewSentences: Information: Function completed. Duration: 3.755162s
Microsoft.SemanticKernel.Connectors.OpenAI.AzureOpenAIChatCompletionService: Information: Prompt tokens: 404. Completion tokens: 51. Total tokens: 455.
Microsoft.SemanticKernel.KernelFunctionFactory: Information: Prompt tokens: 404. Completion tokens: 51.
InvokePromptAsync_061e7ab889354c0ca2da5604ff6b3b97: Information: Function InvokePromptAsync_061e7ab889354c0ca2da5604ff6b3b97 succeeded.
InvokePromptAsync_061e7ab889354c0ca2da5604ff6b3b97: Trace: Function result: [REDACTED].
InvokePromptAsync_061e7ab889354c0ca2da5604ff6b3b97: Information: Function completed. Duration: 22.6347055s

Platform

OS: Windows
LLM: Azure Open AI - GPT-4 turbo-2024-04-09
IDE: Visual Studio, VS Code
Language: C#
Source: Microsoft.SemanticKernel v1.11.1

Additional context No planner has been used. Auto invoke kernel functions was requested as part of the InvokePromptAsync call.

dmytrostruk commented 3 months ago

@azure-claudiu Thanks for reporting this issue! 22.634 seconds is definitely a lot, but I cannot reproduce this issue locally using console application.

May I ask you to run Debug > Performance Profiler with Instrumentation enabled in Visual Studio? https://learn.microsoft.com/en-us/visualstudio/profiling/profiling-feature-tour?view=vs-2022#instrumentation

This should output the breakdown for each C# function, execution time and number of calls:

With this view, you should be able to see where exactly there is a delay in execution.

stephentoub commented 3 months ago

Observe below the difference between the function call duration (3.755 s) and the total duration (22.634 s).

After GenFewSentences is invoked, it's going to make a new request back to the service with the full existing chat history plus the result of GenFewSentences. Isn't the time after GenFewSentences just that secondary roundtrip to the model, including its full generation of all tokens in the response?

dmytrostruk commented 3 months ago

Isn't the time after GenFewSentences just that secondary roundtrip to the model, including its full generation of all tokens in the response?

That's true, the second roundtrip may be a problem. @azure-claudiu If you don't want to send second request to LLM and instead return function result immediately, you can terminate the process with auto function invocation filters, like in this example: https://github.com/microsoft/semantic-kernel/blob/7cb651a809a34ed8cbb35d7695db182382024ec7/dotnet/samples/Concepts/Filtering/AutoFunctionInvocationFiltering.cs#L79

This will return function result as it is, which should improve execution time. But this result won't be formatted or re-phrased by AI, so it depends if plain function result will work in your scenario or not.

azure-claudiu commented 3 months ago

I see; I wasn't aware of a second request. I tried the IAutoFunctionInvocationFilter approach and it seems to work! Now the kernel invoke returns the result generated by the function (1) immediately and (2) unchanged. This explains why previously the output of the function was not returned verbatim, but it was slightly modified (rephrased), which was undesirable for our use case.

The difference between the function call duration and the total duration is now smaller, and it's only due to the time spent before calling the function.

Thanks for clarifying this behavior and providing the solution!

azure-claudiu commented 3 months ago

I have a question though. By setting context.Terminate = true I am preventing a secondary call to the LLM (which I want) but I'm also preventing a secondary call to one of my functions (which I don't want). Is there a way to just disable this non-plugin call to the LLM?

azure-claudiu commented 3 months ago

From the documentation of the Terminate property:

As soon as it's set to true, the remaining functions won't be executed and last request to LLM won't be performed.
Automatic function invocation process will be terminated and result of last executed function will be returned to the caller.

I guess I'd like the remaining functions to be executed, but the last request to the LLM to be skipped.

stephentoub commented 3 months ago

but I'm also preventing a secondary call to one of my functions (which I don't want)

What secondary call are you referring to? Can you elaborate on the scenario?

With function calling, the client gives the chat history to the model along with a list of functions the model is able to request be invoked if it deems that valuable. The model/service then sends back a response, which is either the final response or an intermediate request for a function to be invoked. In the latter case, SK will invoke the function and send back the results. At which point the model may send back a final response or may ask for another function to be invoked. And so on. After invoking a function and before sending back the results, the client has no idea whether the model will request another function to be invoked or not.

azure-claudiu commented 3 months ago

Certainly. As a generic example, let's assume I have 2 functions: GenerateHaiku and ValidateHaiku.

I noticed this flow when using the prompt: "Generate a haiku" -> LLM -> GenerateHaiku -> LLM -> back to caller That second call to the LLM doesn't return the output of the GenerateHaiku function as is, but processed (changed, rephrased) and it also adds extra time. Since I don't want that behavior, I can use context.Terminate = true and the output of GenerateHaiku is returned immediately and unchanged to the caller.

But if I wanted to use the prompt: "Generate a haiku and validate it" -> LLM -> GenerateHaiku -> LLM -> ValidateHaiku -> LLM -> back to caller The last call to the LLM is again not desired because of extra processing it might do to the output of the ValidateHaiku function, and the extra time it takes. If I use context.Terminate = true when filtering the GenerateHaiku call, I will prevent the call to ValidateHaiku from taking place.

Would there be a way to avoid the processing of that last LLM call? I noticed I can't tell from the auto invocation filter that we're dealing with the last function call (i.e. context.FunctionCount == 1 always).

I can provide logs of the actual "processing" done after the function call, if needed. I would just have to redact them a bit.

stephentoub commented 3 months ago

Would there be a way to avoid the processing of that last LLM call? I noticed I can't tell from the auto invocation filter that we're dealing with the last function call (i.e. context.FunctionCount == 1 always).

We don't know if it's the last. The model can always ask for additional functions to be invoked as it sees fit. Only you have the context to know whether you consider it to be done or not. So, for example, if you want to always end after ValidateHaiku is called, you can check which function is being invoked and terminate if it's ValidateHaiku.

There's just a loop here like the following pseudo-code:

while (true)
{
    var response = SendRequest(history);

    if (response.HasFunctionCallRequests)
    {
        var results = PerformFunctionCalls(response);
        UpdateHistory(history, response, results);
        continue;
    }

    return response;
}

SK has that loop built-in for you to use optionally, but if you don't, you'd just need the same loop in your own code. Whatever you'd do to choose to break out of the loop is what you check for and use to set terminate in the filter.

azure-claudiu commented 3 months ago

I think we can close this issue. I will experiment more with function composability, but the ability to return immediately after a function call solves my immediate problem. Thank you!

microsoft / semantic-kernel

.Net: Delay after a semantic function has been called #6441