[Observability] Initial Observability for Chat Models

In the GenAI ecosystem, many solutions have been introduced that are centered around the concepts of observability and evaluation of Large Language Models. Unlike more traditional applications, observability becomes even more a concern to be addressed already as part of the local development workflow, because it enables refining prompts and tuning the model integration to fit the case at hand. As such, I think it would be really great to start introducing some basic observability features in Spring AI before cutting the 1.0.0 GA release.

Background

From a Spring AI perspective, I believe it's useful to categorise the available GenAI observability products as follows:

Products relying on OpenTelemetry to standardize the instrumentation, collection, and visualisation of telemetry data (traces, metrics, logs, and more). Examples: OpenLit, OpenLLMetry/TraceLoop, Arize Phoenix, LangTrace.
Products relying on proprietary/custom strategies for instrumentation, collection, and visualisation of telemetry data. Examples: LangSmith, LangFuse, Lunary.

Considering the consolidation around OpenTelemetry (the industry standard for observability) and the many integration possibilities with the rest of the ecosystem, I would recommend drawing inspiration from the first group of products to understand better the types of features needed for observability, with a possible goal to integrate them with Spring AI at some point. There is also work in progress to agree on semantic conventions for GenAI applications in the OpenTelemetry project. Those conventions are still under active development, but they are being already experimentally adopted in products like OpenLit and OpenLLMetry/TraceLoop. A similar attempt of agreeing on standard naming conventions is also brought forward within the OpenInference group.

Context

What does it mean observing an LLM-powered applications? There are many use cases, going from more general observability to specific evaluation scenarios. For starters, I want to focus on the fundamentals, as also highlighted in this blog post from OpenTelemetry. In particular, what kind of telemetry do we need?

LLM Request

Provider (mistral)
Operation (chat.completion)
Model (open-mistral-7b)
Temperature (0.7)
Top P (0)
Prompt content (sensitive data, so to be handled optionally and separately)

LLM Response

Model (open-mistral-7b-0118999)
Finish reason (stop)
Used prompt tokens (420)
Used completion tokens (42)
Completion content (sensitive data, so to be handled optionally and separately)

Proposal

My suggestion for addressing observability in Spring AI is to split the problems into smaller tasks.

Instrumenting the code in Spring AI to allow collecting telemetry data for model integrations, vector stores, and higher-level AI workflows.
Configuring the export of telemetry data to comply with specific conventions and allow integrations with different platforms.

In particular, I propose to start focusing on the chat models to refine and validate the solution, before extending the scope to other types of models, vector stores, and higher-level AI workflows.

Micrometer allows to instrument the code once and export telemetry both via OpenZipkin and OpenTelemetry, and offers good APIs to plug in different semantic conventions to customise the exported telemetry. So it aligns with how the rest of the Spring ecosystem is instrumented, but it will also allow to integrate with all those LLM Observability solutions that follow the OpenTelemetry standard. Furthermore, interested vendors can always implement their custom exporters and hook them into the Micrometer-based instrumentations, even though they rely on proprietary data formats and protocols.

We can split this task further into two activities:

Micrometer core libraries to define AI-related observations, non-Spring specific. I'm suggesting it here in the context of Spring AI, but it would be nice to have a shared Micrometer foundation re-usable across the Java ecosystem.
Instrumentation for Spring AI-specific areas, built on top of the common Micrometer foundation. This also includes auto-configuration.

Micrometer AI Foundation

For the Micrometer foundation, I suggest to start with the entities shown in the diagram. In particular:

The ModelObservationContext will hold all the data related to chat requests and responses that we are interested in possibly use in the telemetry data.
A default implementation of a ModelObservationConvention that will determine how to use the contextual data to build low cardinality and high cardinality key-value pairs, that will ultimately end up being exported as metrics and traces.

Additionally, there can also be two default implementations of ObservationFilter to optionally include prompt and completion content in the telemetry. This is where we can fulfil the feature request for logging prompts and completions, as I mentioned in https://github.com/spring-projects/spring-ai/issues/512#issuecomment-2185096414.

The default ModelObservationConvention implementation can be replaced with a different one to adopt different semantic conventions, for example to adopt the standard OpenTelemetry conventions or to integrate better with solutions like OpenLit or OpenLLMetry. I would keep that part out of scope since it's an area under active development and things will change very often. I imagine having an external library providing those experimental conventions (I'm actually drafting something in that direction).

Untitled-2024-06-25-2327

Chat Models Instrumentation

Building on top of that foundation, we need to instrument the ChatModel implementations in Spring AI. We can't currently rely purely on the information available through the contract of the ChatModel interface (input: Prompt, output: ChatResponse), meaning we cannot leverage any interface/abstract class/AOP. Most of the contextual data needed for observing an LLM request is available in the model-specific API which is part of the specific implementation and not available through the interface/abstract class (this might be a point to consider for the future, if there's a way to expand the current abstractions).

We could of course go a level lower and instrument based on the provider-specific request and response objects, which can be orchestrated via a parent abstract class (like the doChatCompletion() method currently implemented in the models that support functions). On the one hand, it might allow some centralisation of the instrumentation trigger (the observe() wrapper method from Micrometer), but it would also mean having to extract data from the provider-specific integration twice: one time for the observation context and one time for the Spring AI abstractions, overall resulting in more code and less efficient implementation. When evaluating pros and cons, I tend towards the first option I mentioned.

As shown in the diagram, we populate a ModelObservationContext from the input and output of the call() method in OpenAiChatModel ( Prompt and ChatResponse), but also the OpenAI-specific and internal ChatCompletionRequest object.

Untitled-2024-06-25-2327-2

Finally, we can introduce auto-configuration to pass an optional ObservationRegistry to OpenAiChatModel that will enable the instrumentation. And we can also define optional beans controlled via configuration properties, for example the ObservationFilters for including prompt and completion content into the telemetry.

Discussion

I'm looking forward to hearing from you with some feedback and thoughts regarding this proposal. In particular, @markpollack and @tzolov, what do you think? Also, the Micrometer team might have some inputs about this and perhaps suggestions on how to get a base foundation to be used across the Java ecosystem as well. @marcingrzejszczak

I have opened a WIP pull request to present a possible implementation of some of the ideas I shared above. For now, I focused mostly on traces and chat models (OpenAI as an example). If you'd like to give it try, I prepared a simple app to manually verify the observations exported via OpenTelemetry.

I've been doing more experiments and you can see some of them in this other project, but it's very messy and outdated, so don't spend too much time on it :) There I tried the semantic convention customisation and integration with OpenLit and OpenLLMetry. I also started a discussion within the OpenLit project around some ideas that will make it more straightforward to have Spring AI integrated with that solution, all centered around the OpenTelemetry standardisation (https://github.com/openlit/openlit/issues/300 and https://github.com/openlit/openlit/issues/299).