microsoft / kernel-memory

RAG architecture: index and query any data using LLM and natural language, track sources, show citations, asynchronous memory patterns.
https://microsoft.github.io/kernel-memory
MIT License
1.63k stars 313 forks source link

Update AzureOpenAIConfig validation to cater for reverse proxy endpoint #875

Closed alexmg closed 1 week ago

alexmg commented 1 month ago

Motivation and Context (Why the change? What's the scenario?)

The intent of the change in this PR is to allow Azure OpenAI requests to be made over HTTP to a reverse proxy that is running within the same Docker Compose orchestration for local development.

We have reverse proxy in place for Azure OpenAI that routes requests to different service instances based on available quota. When deployed to Azure all services can use HTTPS and Kernel Memory can use the reverse proxy for Azure OpenAI requests.

Visual Studio with Docker Compose orchestration is used for local development. The reverse proxy project is started and exposes both HTTP and HTTPS ports. A local dev certificate is used automatically by the tooling and the reverse proxy can be called over HTTPS from the host machine using localhost.

The problem is that the Kernel Memory container is also started as part of the Docker Compose orchestration and uses the public Docker image with a settings file mounted from the host. When Kernel Memory is configured to call the reverse proxy container over HTTPS for Azure OpenAI the requests fail because it does not trust the dev certificate and cannot establish the SSL connection. This is because the dev certificate created by ASP.NET Core is for localhost and not the hostname of the container running within the orchestration.

Installing certificates into the containers within the Docker Compose orchestration without modifying the container image is difficult. I also don't want to build another container image with trust for a self-signed certificate because the same container should be used for development and production. We instead use a Kernel Memory setting file mounted into the Kernel Memory container that is designed for running in the local Docker Compose orchestration. This setting file configures the AzureOpenAIConfig.Endpoint settings with a URL containing the name of the container running the reverse proxy, but the connection cannot be established due to the certificate trust issues mentioned above.

Other services do not have the same restrictions applied, making them far easier to run in containers for local development. When running SQL Server in a local container the connection string can have TrustServerCertificate set to True to avoid trust issues with the server certificate. Redis can be run in a container with a connection string that allows access to the container without SSL being enabled. RabbitMQ can even have SSL explicitly disabled using the RabbitMQConfig.SslEnabled property.

Even though the real Azure OpenAI service is only exposed over HTTPS, the Azure.AI.OpenAI library allows for HTTP requests, which I would imagine is to help with local development and testing scenarios such as this. Traffic is not being directed at localhost or 127.0.0.1 because the services are running in containers, but conceptually the scenario is the same because the connections are between containers running within an orchestration on a developer machine.

High level description (Approach, Design)

Update the validation in the AzureOpenAIConfig class:

- First ensure the Endpoint value is a valid Uri and throw a ConfigurationException when invalid - If the host domain ends with .openai.azure.com and requests are going directly to the Azure OpenAI endpoints ensure that the scheme is https - If the host domain is not the public Azure OpenAI endpoints ensure that the schema is either http or https

dluc commented 4 weeks ago

Even when using proxies, sending traffic over HTTP without encryption is not secure. Removing the HTTPS validation on the AzureOpenAIconfig class isn’t something we can merge, as it would compromise our security baseline.

Why not configure the proxy to use HTTPS instead? This would allow to maintain the necessary security without compromising the routing solution.

The only potential exception could be traffic on 127.0.0.1, but since it appears that traffic is going beyond the host, it’s especially important to keep this protection in place.

alexmg commented 3 weeks ago

Hi @dluc. My apologies, I was in a rush to open the PR and did not fully explain the scenario, which is something I should have done for a change such as this. Hopefully, the details in the updated description better describe the problem and a workable solution can be found. The situation is like using 127.0.0.1 except the services are running within containers on the local machine, and not directly on the host.

dluc commented 3 weeks ago

Hi @dluc. My apologies, I was in a rush to open the PR and did not fully explain the scenario, which is something I should have done for a change such as this. Hopefully, the details in the updated description better describe the problem and a workable solution can be found. The situation is like using 127.0.0.1 except the services are running within containers on the local machine, and not directly on the host.

Hi @alexmg, thanks for the clarification, and no worries about the rush, the additional context is useful.

I’d suggest adjusting the code to allow "http" only when using 127.0.0.1. If the proxy is bound to any other IP or hostname, enforcing SSL would be best. While it adds minimal computational overhead, the added security is well worth it.

Even though your proxies are local, the code itself can’t verify that. For example, consider a proxy on 192.168.1.10, running on the same host. You know it’s safe because you’re familiar with the network topology. But someone else might set up the same IP on a different machine, creating a potential risk where LLM traffic could be intercepted or manipulated without any indication in the code.

alexmg commented 3 weeks ago

I’d suggest adjusting the code to allow "http" only when using 127.0.0.1. If the proxy is bound to any other IP or hostname, enforcing SSL would be best. While it adds minimal computational overhead, the added security is well worth it.

This will not help when running Kernel Memory and its dependent services in containers for local development. The containers communicate with each other using a hostname visible only within the private network. Limiting "http" to loopback would only be useful when running all services on the host machine. Running all services in a bridge network in Docker on the host machine has the benefit of network isolation.

I don't think there is a reliable way to know that the container is running within Docker Compose launched from Visual Studio. When adding containerized projects to the orchestration I modify the docker-compose.override.yml file to include an environment variable called RUNNING_IN_DOCKER_COMPOSE, but this is not something that is included automatically by Visual Studio. The official dotnet images have the DOTNET_RUNNING_IN_CONTAINER environment variable but this only lets you know you are running in a container and not under what circumstances. The person configuring the service knows where it will be running and should be able to configure things accordingly.

I'm not worried about the additional computational overhead of SSL, it's the management of certificates within all the containers that need to communicate with each other that is difficult. To facilitate the SSL connections self-signed certificates would need to be generated, and the containers forced to trust those certificates, with the fake CA being added to the operating system. With the many different base images and Linux distributions available getting all these certificates in place with the right DNS names would be challenging to say the least. Even with all that in place the SSL connections would not truly be trustworthy, and the traffic between containers would never have left the private network.

I am still confused why the configuration for this one service has these concerns and restrictions while the other services do not. Azure Blob, Azure Queues, MongoDB, Postgresql, Redis, and SQL Server all have connection strings that can contain any desired configuration including hostname. Ollama, OpenAI, Qdrant, S3 all have endpoints that can contain any hostname. These are all friendly to local development with containers because they allow the person configuring the service to make decisions based on the information they have. Placing similar restrictions on these services would only make Kernel Memory harder to work with and less attractive to developers. Running containers for local development with Docker Compose is already a popular option, and with .NET Aspire gaining traction the trend is likely to grow even more.

Even though your proxies are local, the code itself can’t verify that. For example, consider a proxy on 192.168.1.10, running on the same host. You know it’s safe because you’re familiar with the network topology. But someone else might set up the same IP on a different machine, creating a potential risk where LLM traffic could be intercepted or manipulated without any indication in the code.

This is not a problem when running the containers in Docker Compose because they can only see other containers in the same bridge network and resolve them by hostname. There are no other machines running in the network other than the ones you explicitly start there as containers. The network isolation provided by the bridge network is beneficial and I feel more comfortable with HTTP communications taking place there than on the host loopback.

dluc commented 3 weeks ago

I’d suggest adjusting the code to allow "http" only when using 127.0.0.1. If the proxy is bound to any other IP or hostname, enforcing SSL would be best. While it adds minimal computational overhead, the added security is well worth it.

This will not help when running Kernel Memory and its dependent services in containers for local development. The containers communicate with each other using a hostname visible only within the private network. Limiting "http" to loopback would only be useful when running all services on the host machine. Running all services in a bridge network in Docker on the host machine has the benefit of network isolation.

I don't think there is a reliable way to know that the container is running within Docker Compose launched from Visual Studio. If the container is built from a project in the solution a RUNNING_IN_DOCKER_COMPOSE environment variable is added, but this is not the case if the service points to an existing image in a container registry. The official dotnet images have the DOTNET_RUNNING_IN_CONTAINER environment variable but this only lets you know you are running in a container and not under what circumstances. The person configuring the service knows where it will be running and should be able to configure things accordingly.

I'm not worried about the additional computational overhead of SSL, it's the management of certificates within all the containers that need to communicate with each other that is difficult. To facilitate the SSL connections self-signed certificates would need to be generated, and the containers forced to trust those certificates, with the fake CA being added to the operating system. With the many different base images and Linux distributions available getting all these certificates in place with the right DNS names would be challenging to say the least. Even with all that in place the SSL connections would not truly be trustworthy, and the traffic between containers would never have left the private network.

I am still confused why the configuration for this one service has these concerns and restrictions while the other services do not. Azure Blob, Azure Queues, MongoDB, Postgresql, Redis, and SQL Server all have connection strings that can contain any desired configuration including hostname. Ollama, OpenAI, Qdrant, S3 all have endpoints that can contain any hostname. These are all friendly to local development with containers because they allow the person configuring the service to make decisions based on the information they have. Placing similar restrictions on these services would only make Kernel Memory harder to work with and less attractive to developers. Running containers for local development with Docker Compose is already a popular option, and with .NET Aspire gaining traction the trend is likely to grow even more.

Even though your proxies are local, the code itself can’t verify that. For example, consider a proxy on 192.168.1.10, running on the same host. You know it’s safe because you’re familiar with the network topology. But someone else might set up the same IP on a different machine, creating a potential risk where LLM traffic could be intercepted or manipulated without any indication in the code.

This is not a problem when running the containers in Docker Compose because they can only see other containers in the same bridge network and resolve them by hostname. There are no other machines running in the network other than the ones you explicitly start there as containers. The network isolation provided by the bridge network is beneficial and I feel more comfortable with HTTP communications taking place there than on the host loopback.

👍 Thanks for the details about docker network. I'll take some time to evaluate the options

dluc commented 2 weeks ago

@alexmg if we fixed this problem, would that be sufficient?

When Kernel Memory is configured to call the reverse proxy container over HTTPS for Azure OpenAI the requests fail because it does not trust the dev certificate and cannot establish the SSL connection. This is because the dev certificate created by ASP.NET Core is for localhost and not the hostname of the container running within the orchestration.

To address this, we could add a configuration option to define a list of trusted hostnames for which certificate validation would be bypassed. This would enable secure, encrypted traffic while allowing connections to hosts with untrusted SSL certificates, based on a controlled hostname list. Optionally, we could implement cert pinning to further restrict which invalid certificates are allowed.

The config could look something like this:

{
  "KernelMemoery": {
    // ...
    "Network": {
      "TrustedHostsWithoutSSL": [
        "localhost",
        "my-proxy.local:8080"
        // ...
      ],
      "UnsignedCertsTrustedThumbprints": [
        "ABCDEF1234567890ABCDEF1234567890ABCDEF12",
        "ABCDEF1234567890ABCDEF1234567890ABCDEF34"
        // ...
      ]
    }
    // ...
  }
}

By the way, all these can be managed via env vars, so you could set values also in docker config files

KernelMemory__Network__TrustedHostsWithoutSSL__0=localhost
KernelMemory__Network__TrustedHostsWithoutSSL__1=my-proxy.local:8080
KernelMemory__Network__UnsignedCertsTrustedThumbprints__0=ABCDEF1234567890ABCDEF1234567890ABCDEF12
KernelMemory__Network__UnsignedCertsTrustedThumbprints__1=ABCDEF1234567890ABCDEF1234567890ABCDEF34
alexmg commented 2 weeks ago

@dluc Yes, I believe this would solve the certificate trust issue and allow for HTTPS traffic to move between the containers within orchestration.

The ASP.NET Core dev certificate that is mounted into the container has a CN of localhost so I believe that would be the most common value. I'm sure having the ability to provide additional host names would be beneficial to others too. My understanding is that the CN in a certificate should not have protocol, port, or path.

I also like the idea of pinning to certificate thumbprints but agree this should be optional because the dev certificate on each developer machine will have a different thumbprint. It's nice to have a F5 experience out of the box that doesn't require additional configuration.

It's possible to run pre-built container images with the dev certificate and this should work for that too.

https://learn.microsoft.com/en-us/aspnet/core/security/docker-https?view=aspnetcore-8.0#running-pre-built-container-images-with-https

dluc commented 2 weeks ago

The ASP.NET Core dev certificate that is mounted into the container has a CN of localhost so I believe that would be the most common value. I'm sure having the ability to provide additional host names would be beneficial to others too. My understanding is that the CN in a certificate should not have protocol, port, or path

I would not use CN for validation, but rather the cert thumbprint. It's very difficult, about impossible as far as I know, to create a cert with a given thumbprint.

You can run this command to find the value we would put in the config:

dotnet dev-certs https --check

A valid certificate was found: E12B2726F0DA59D112807735CFB6F5BC2E4CADBC - CN=localhost - Valid from 2024-10-10 14:12:42Z to 2025-10-10 14:12:42Z - IsHttpsDevelopmentCertificate: true - IsExportable: true

Config value: E12B2726F0DA59D112807735CFB6F5BC2E4CADBC

dluc commented 2 weeks ago

Here's a sample code if you want to try updating the PR and testing with your certs:


// thumbprint of the cert in your docker image
// TODO: this will be a list, for flexibility
string trustedThumbprint = "ABCDEF1234567890ABCDEF1234567890ABCDEF12";

var handler = new HttpClientHandler
{
    ServerCertificateCustomValidationCallback = (request, certificate, chain, sslPolicyErrors) =>
    {
        // Allow invalid root CA (ignores certificate errors related to untrusted roots)
        if (sslPolicyErrors == SslPolicyErrors.None || sslPolicyErrors == SslPolicyErrors.RemoteCertificateChainErrors)
        {
            string remoteCertThumbprint = certificate?.GetCertHashString()?.ToUpperInvariant();
            // TODO: check against a list 
            return string.Equals(remoteCertThumbprint, trustedThumbprint, StringComparison.OrdinalIgnoreCase);
        }

        // reject other SSL errors
        return false;
    }
};

using var httpClient = new HttpClient(handler);
try
{
    // note: using "https"
    var response = await httpClient.GetAsync("https://your-proxy");
    // ...
}
catch (HttpRequestException ex)
{
    // ...
}
alexmg commented 1 week ago

Hi @dluc.

I have updated the branch with the ServerCertificateCustomValidationCallback implementation and had to adjust the logic around checking the SslPolicyErrors values. Because the ASP.NET Core development certificate is self-signed and the request between containers is not to localhost, both the RemoteCertificateNameMismatch and RemoteCertificateChainErrors flags are set on the SslPolicyErrors provided to callback.

I renamed the option to TrustedCertificateThumbprints because it is both the certificate chain and name that can be mismatched when using the dev certificates. The dev certificate has a CN of localhost but the call to another container would be to an internal host name like proxy.

The HttpClient with the custom HttpClientHandler is only created if one is not already provided and the configuration has values configured in TrustedCertificateThumbprints.

I noted there was an opportunity to utilize the IHttpClientFactory to extend the functionality to other services like Anthropic but decided to leave the options on the AzureOpenAIConfig class for now. I guess they could be moved to a more central location if others are proxying additional services and would benefit from the same feature.