HTTP Client configuration for models and vector stores

ThomasVitale commented 8 months ago

Enhancement Description

Each model integration is composed of two aspects: an *Api class calling the model provider over HTTP, and a *Client class encapsulating the LLM specific aspects.

Each *Client class is highly customizable based on nice interfaces, making it possible to overwrite many different options. It would be nice to provide similar flexibility for each *Api class as well. In particular, it would be useful to be able to configure options related to the HTTP Client.

Examples of aspects that would need to be configured:

enable logging of requests/responses, very useful for general troubleshooting but also for refining prompts during development and testing;
define connection and read timeout settings;
configure an SslBundle to connect with on-prem model providers using custom CA certificates;
configure connections through a corporate proxy, very common in production deployments.

Furthermore, there might be additional needs for configuring resilience patterns:

configure retry strategy in case of failures;
define a fallback logic in case of failures.

More settings that right now are part of the model connection configuration (and that still relates to the HTTP interaction) would also need to be customisable in enterprise use cases in production (e.g. multi-user applications or even multi-tenant applications). For example, when using OpenAI, the following could need changing per request/session.

API Key
Organization
User

All the above is focused on the HTTP interactions with model providers, but the same would be useful for vector stores.

Possible Solutions

Drawing from the nice abstractions designed to customize the model integrations and ultimately implementing the ModelOptions interface, it could be an idea to define a dedicated abstraction to pass HTTP client customizations to an *Api class (something like HttpClientConfig), which might also be exposed via configuration properties (under spring.ai.<model>.client.*).

For the more specific resilience configurations (like retries and fallbacks), an annotation-driven approach might be more suitable. Resilience4j might provide a way to achieve this, since I don't think Spring supports the Fault Tolerance Microprofile spec.

A partial alternative solution would be for developers to define a custom RestClient.Builder or WebClient.Builder and pass that to each *Api class, but it would result in a lot of extra configurations and reduce the convenience of the autoconfiguration. Also, it would tight a generic configuration like "enable logs" or "use a custom CA" to the specific client used, resulting in duplication when both blocking and streaming interactions are used in the same application.

I'm available to contribute and help solve this issue.

Related Issues

thingersoft commented 8 months ago

Hello, that's more or less the same strategy I tought to use for a generic approach to the timeout problem. I think it's a crucial aspect to take care of moving towards a 1.0 release since we are talking about common requirements for non streaming consumers. Also when a read timeout occurs you lost the response forever and for larger commercial models it means money.

I was available to contribute too but till now I had little luck getting feedback from project owners.

markpollack commented 5 months ago

There is a lot to unpack here, so let's start small and work our way to more features.

At the lowest level, we are using either our own hand written client to talk with a model, OpenAiApi is a perfect example. If a user is operating at this level, there are a few things that can be done.

We can add some trace or debug level logging that can be enabled in the typical spring boot manner.
A user can also create a RestClientCustomizer as shown in this example.

For other models, for example AzureOpenAI or Google vertex, we are using client libraries provided by Microsoft and Google and we can't use the approach above.

We can however at a high level, the ChatClient level, I first thought we could introduce a logging advisor to the code base but the advisor doesn't yet have access to the final prompt, only the parts that go into making it. So instead we should update ChatModel implementations to do the logging at the appropriate places in that class. This issue discusses that.

Potentially we can still have a logging advisor, but it would serve a different purpose, and is likely still a useful addition.

On another topic, of retry, this could potentially move out of the *Api classes and be moved into an advisor. The issue there is that retry would only kick in if using ChatClient and not the *Api classes. I suspect that the right strategy is to put retry in when we can at the lowest level and also provide a retry advisor to be used when we don't control the underlying library that communicates with the AI model.

Thoughts?

piotrooo commented 5 months ago

I like the idea of creating advisors for logging purposes :+1:

However, when thinking about retry logic...

Currently, we handle two ways of calling models:

Using an HTTP client *Api - RestClient or WebClient
Using an SDK - such as Azure OpenAIClient or Google GenerativeModel

I imagine the retry logic should be the same across all models. Tying it to the *Api classes doesn't allow us to reuse it in the SDK scenarios. Additionally, we should consider models that don't use a ChatClient, such as transcription or speech models.

Therefore, I suggest introducing a new retry layer — or even more broadly, a resilience layer (starting with retry support but with the potential to add new features in the future):

There could also be several other layers for customizing the HTTP client and so on, as @ThomasVitale mentioned.

ThomasVitale commented 5 months ago

@markpollack @piotrooo thank you both for sharing your thoughts!

I see two types of logs that can be useful in an application using Spring AI. My original intent with this issue was to cover the first type.

HTTP Requests/Responses. Logging of headers and/or body of the HTTP interactions with an LLM provider. For example, this is useful when troubleshooting what's the underlying format of a request/response and spot JSON conversion errors or incompatibilities with updated APIs from the provider.
- For all the *Api classes provided by Spring AI, I think there should be a way to customise the underlying RestClient or WebClient with a logging interceptor (and similarly also timeouts and SslBundles). The workaround shown here is good enough for experiments, but it cannot really be used in real-world application because the RestClientCustomizer/WebClientCustomizer would be shared across the application.
- For all the integrations where third-party libraries are used (such as Vertex AI or Azure OpenAI), I expect those libraries to provide options for logging requests/responses (as well as timeouts and TLS). That's not something Spring AI can solve (unless perhaps surfacing some auto configuration properties, should that capability exist in those libraries).
Prompt/Completion. Logging of the content of a prompt or a completion. For example, this is very important when it comes to prompt design/evaluation or observability. I would not recommend implementing such functionality via explicit log messages in the ChatClient API (or the underlying ChatModel). Instead, I would recommend framing this feature in the broader context of introducing observability for Spring AI. Using the Micrometer Observation API, it's possible to instrument the ChatModel classes once and configure logs, metrics, and traces through the Micrometer machinery. It's critical to include prompt/completion content in the observability solution because it's necessary for any evaluation/prompt design integration. I have a draft solution I'll share soon, I need to polish a few things. I wouldn't introduce a LoggingAdvisor at the moment. I think we need first the observability foundation at the ChatModel level before addressing further observability needs at the ChatClient level (using Advisors to offer observability for these higher-level workflows/chains, which typically consist of multiple LLM requests and function calls).

What do you think?

piotrooo commented 5 months ago

That's not something Spring AI can solve (unless perhaps surfacing some auto configuration properties, should that capability exist in those libraries).

I thought about some customizers for SDK clients, but I'm not really convinced by this approach. However, I think this is probably how I want to customize e.g., Azure OpenAIClient (and others).

I think we need first the observability foundation at the ChatModel level before addressing further observability needs at the ChatClient level (using Advisors to offer observability for these higher-level workflows/chains, which typically consist of multiple LLM requests and function calls).

Right now, ChatClient is going to be a Swiss army knife with observability, retries, ahh and, of course sending requests to the model :grimacing:. But for now, I don't have a better idea.

spring-projects / spring-ai