Open ThomasVitale opened 1 week ago
Hey @ThomasVitale ,
This is exactly what i've been exploring in the past week or so. It is a great and promising stuff.
I did a spike: https://github.com/tzolov/spring-ai/tree/observability-support to explore what the end solution might look like.
Wanted to grasp the boundaries, challenges, limitation. To find how well we can instrument the entire execution flows and so on.
So in this branch I've implemented (basic) observation instrumentation for ChatModel
, EmbeddingModel
, VectorStore
, ChatClient
and Advisors
.
I also used the ai-observability-demo app to test the E2E solution.
For example below are the Tempo traces for a RAG + ChatMemory pipeline:
var response = chatClient.prompt()
.user("How does Carina work?")
.advisors(new QuestionAnswerAdvisor(vectorStore, SearchRequest.defaults()))
.advisors(new PromptChatMemoryAdvisor(chatMemory))
.call()
.chatResponse();
@ThomasVitale I agree with most of the observation and conclusions you make above. But will contact you to start f2f discussion to clarify few technical/design problems I came across. Most importantly we need to discuss the conventions and how to structure the observations across the projects.
I will try to reach out to you on linkedin
In the GenAI ecosystem, many solutions have been introduced that are centered around the concepts of observability and evaluation of Large Language Models. Unlike more traditional applications, observability becomes even more a concern to be addressed already as part of the local development workflow, because it enables refining prompts and tuning the model integration to fit the case at hand. As such, I think it would be really great to start introducing some basic observability features in Spring AI before cutting the 1.0.0 GA release.
Background
From a Spring AI perspective, I believe it's useful to categorise the available GenAI observability products as follows:
Considering the consolidation around OpenTelemetry (the industry standard for observability) and the many integration possibilities with the rest of the ecosystem, I would recommend drawing inspiration from the first group of products to understand better the types of features needed for observability, with a possible goal to integrate them with Spring AI at some point. There is also work in progress to agree on semantic conventions for GenAI applications in the OpenTelemetry project. Those conventions are still under active development, but they are being already experimentally adopted in products like OpenLit and OpenLLMetry/TraceLoop. A similar attempt of agreeing on standard naming conventions is also brought forward within the OpenInference group.
Context
What does it mean observing an LLM-powered applications? There are many use cases, going from more general observability to specific evaluation scenarios. For starters, I want to focus on the fundamentals, as also highlighted in this blog post from OpenTelemetry. In particular, what kind of telemetry do we need?
LLM Request
LLM Response
Proposal
My suggestion for addressing observability in Spring AI is to split the problems into smaller tasks.
In particular, I propose to start focusing on the chat models to refine and validate the solution, before extending the scope to other types of models, vector stores, and higher-level AI workflows.
Micrometer allows to instrument the code once and export telemetry both via OpenZipkin and OpenTelemetry, and offers good APIs to plug in different semantic conventions to customise the exported telemetry. So it aligns with how the rest of the Spring ecosystem is instrumented, but it will also allow to integrate with all those LLM Observability solutions that follow the OpenTelemetry standard. Furthermore, interested vendors can always implement their custom exporters and hook them into the Micrometer-based instrumentations, even though they rely on proprietary data formats and protocols.
We can split this task further into two activities:
Micrometer AI Foundation
For the Micrometer foundation, I suggest to start with the entities shown in the diagram. In particular:
ModelObservationContext
will hold all the data related to chat requests and responses that we are interested in possibly use in the telemetry data.ModelObservationConvention
that will determine how to use the contextual data to build low cardinality and high cardinality key-value pairs, that will ultimately end up being exported as metrics and traces.Additionally, there can also be two default implementations of
ObservationFilter
to optionally include prompt and completion content in the telemetry. This is where we can fulfil the feature request for logging prompts and completions, as I mentioned in https://github.com/spring-projects/spring-ai/issues/512#issuecomment-2185096414.The default
ModelObservationConvention
implementation can be replaced with a different one to adopt different semantic conventions, for example to adopt the standard OpenTelemetry conventions or to integrate better with solutions like OpenLit or OpenLLMetry. I would keep that part out of scope since it's an area under active development and things will change very often. I imagine having an external library providing those experimental conventions (I'm actually drafting something in that direction).Chat Models Instrumentation
Building on top of that foundation, we need to instrument the
ChatModel
implementations in Spring AI. We can't currently rely purely on the information available through the contract of theChatModel
interface (input:Prompt
, output:ChatResponse
), meaning we cannot leverage any interface/abstract class/AOP. Most of the contextual data needed for observing an LLM request is available in the model-specific API which is part of the specific implementation and not available through the interface/abstract class (this might be a point to consider for the future, if there's a way to expand the current abstractions).We could of course go a level lower and instrument based on the provider-specific request and response objects, which can be orchestrated via a parent abstract class (like the
doChatCompletion()
method currently implemented in the models that support functions). On the one hand, it might allow some centralisation of the instrumentation trigger (theobserve()
wrapper method from Micrometer), but it would also mean having to extract data from the provider-specific integration twice: one time for the observation context and one time for the Spring AI abstractions, overall resulting in more code and less efficient implementation. When evaluating pros and cons, I tend towards the first option I mentioned.As shown in the diagram, we populate a
ModelObservationContext
from the input and output of thecall()
method inOpenAiChatModel
(Prompt
andChatResponse
), but also the OpenAI-specific and internalChatCompletionRequest
object.Finally, we can introduce auto-configuration to pass an optional
ObservationRegistry
toOpenAiChatModel
that will enable the instrumentation. And we can also define optional beans controlled via configuration properties, for example theObservationFilter
s for including prompt and completion content into the telemetry.Discussion
I'm looking forward to hearing from you with some feedback and thoughts regarding this proposal. In particular, @markpollack and @tzolov, what do you think? Also, the Micrometer team might have some inputs about this and perhaps suggestions on how to get a base foundation to be used across the Java ecosystem as well. @marcingrzejszczak
I have opened a WIP pull request to present a possible implementation of some of the ideas I shared above. For now, I focused mostly on traces and chat models (OpenAI as an example). If you'd like to give it try, I prepared a simple app to manually verify the observations exported via OpenTelemetry.
I've been doing more experiments and you can see some of them in this other project, but it's very messy and outdated, so don't spend too much time on it :) There I tried the semantic convention customisation and integration with OpenLit and OpenLLMetry. I also started a discussion within the OpenLit project around some ideas that will make it more straightforward to have Spring AI integrated with that solution, all centered around the OpenTelemetry standardisation (https://github.com/openlit/openlit/issues/300 and https://github.com/openlit/openlit/issues/299).
Example: Micrometer + OpenTelemetry + Grafana (default conventions)
Example: Micrometer + OpenTelemetry + OpenLit (custom conventions)