microsoft / semantic-kernel

Integrate cutting-edge LLM technology quickly and easily into your apps
https://aka.ms/semantic-kernel
MIT License
21.27k stars 3.12k forks source link

.Net: HttpClients are hold as captive dependencies / singletons #4400

Closed gingters closed 8 months ago

gingters commented 8 months ago

This is a follow-up to #790

Describe the bug

Semantic Kernel captures instances of the HttpClient class in singletons.

See https://github.com/microsoft/semantic-kernel/blob/main/dotnet/src/Connectors/Connectors.OpenAI/OpenAIServiceCollectionExtensions.cs#L55-L59

This will become a massive problem when used in long-running services. The reason is, that the HttpClient holds an internal instance of a HttpMessageHandler. The HttpMessageHandler is not designed to be used for a prolonged amount of time. In .NET in the HttpClientFactory instances of HttpMessageHandles are pooled and re-used for transient(!) usages of HttpClient instances - with a default maximum lifetime of only 2 minutes for the inner handler.

The reason is, that a TCP connection to a server under some circumstances be held open forever. Even if the HTTP transport is not working anymore.

If, for some reason, the DNS information for the configured endpoint changes, or the connection goes bad (i.e. terminates at a proxy and something changes behind the proxy), the HttpMessageHandler instance is not able to recover from that. As such, the instance of the HttpClient that uses this inner handler is rendered unusable for the rest of its lifetime. As a singleton this is for the lifetime of the process.

There is no way to recover from such a problem other than restarting the process.

To Reproduce Use a LoadBalancer in your network (i.e. Big5). "Hide" at least two Azure Open AI services (or the Open AI API) behind this loadbalancer. Make one of these instance active and the other inactive.

Set up any project with Semantic Kernel and an Open AI connector and point it to the loadbalancer/proxy instead of the direct endpoint.

While the project runs, switch the LB to use the configured second instance and switch the first one to inactive. All existing connections to the LB will still be routed to the now inactive first instance and these TCP connections are now broken until a new TCP connection is established (which will be balanced to the second instance).

The SK project will not be able to recover from that and cannot execute calls to the LLM anymore. You will have to stop and restart the process hosting the Kernel to recover from that.

Expected behavior For each call, a new instance of an HttpClient should be retrieved from an IHttpClientFactory instead of capturing an HttpClient in a singleton. In the scenario above, this will be able to automatically recover from such an error at max after 2 minutes when the old broken inner HttpMessageHandler instances are discarded and new Handlers are created.

Platform

dmytrostruk commented 8 months ago

Hi @gingters ! Is it possible to reproduce an issue when you setup PooledConnectionLifetime property? https://learn.microsoft.com/en-us/dotnet/fundamentals/networking/http/httpclient-guidelines#dns-behavior

It looks like it's still recommended to use singleton HttpClient instance with PooledConnectionLifetime property, which should solve both the port exhaustion and DNS changes problems: https://learn.microsoft.com/en-us/dotnet/fundamentals/networking/http/httpclient-guidelines#recommended-use

image

Please let me know if I'm missing something. Thanks a lot!

gingters commented 8 months ago

I ran into that issue with the Chat Copilot app, which uses the default way of registration, in docker, when my notebook switched networks (from wired to wifi, and back). Seems from the point of view of the message handler the tcp connection still seems active (for whatever reason), but the outgoing requests didn't made it through the other network. Also not after 2 minutes. It always ended up in timeouts.

Just to confirm it wasn't an issue with my docker/container environment I shelled in the container and started a new instance of the api on another port, and that other instance could connect to the Open AI api without a problem.

When developing and debugging locally, that a) usually seldom happens and b) isn't a big deal there.

But I thought a connection hiccup when running SK in production (i.e. some changes to a docker network where the backend service runs) etc. and the connections not recovering at all might become a problem, so I reported this.

dmytrostruk commented 8 months ago

@gingters Thank you for reporting this issue!

Maybe it's worth to report similar issue in Chat Copilot repo to see, if something can be improved to handle such scenario, for example change a way how HttpClient is registered.

Tagging @stephentoub to double-check if there is something that can be improved for HttpClient usage on Semantic Kernel side.

Thanks again!

SergeyMenshykh commented 8 months ago

Hi @gingters, today we can't use IHttpClientFactory because the library SK uses to access both OpenAI and AzureOpenAI services does not support it. So, for now, SK consumer code has a few options:

These options should be sufficient for now until either the library supports IHttpClientFactory or .NET fixes HttpClient so that "it just works" regardless of how long you keep its instance. For more details, see the Consider updating HttpClientFactory... issue.

SergeyMenshykh commented 8 months ago

I’m closing the issue since there’s not much we can do at the moment except those options listed above.