[RFC]: Tracing Ontology (WIP)

jon-chuang commented 1 year ago

Feature Description

LlamaIndex can be thought of as an orchestrator and prompt management system across various subtasks.

Here is a proposed ontology of a retrieval pipeline:

session: a session is a multi-round interaction consisting of multiple runs
run: a run is a single round trip from client back to client (e.g. REST endpoint, jupyter notebook cell run). It can consist of multiple tasks.
- the reasons for this become clearer when we consider that we want run-level _triggers_ such as traceback/debug string upon completion. For runs, we will allow completion triggers to depend on subtasks. However, we will _not_ be providing triggers at the task level whose conditions depend on subtasks, as this is likely _too computationally prohibitive_.
- by getting rid of `StreamingResponse`, we may be able to side-step the above problem and provide a per-task completion handler. However, I am still not sure if per-task completion triggers are desirable.
task (currently called Event): A task is a basic unit of work. It can consist of multiple subtasks, also considered tasks. A task can occur at any level of granularity. A task granularity is defined by its task_type (currently EventType) and can instantiate its own callback handler. Examples
- embed
- llm_predict
- retrieve

Example trace:

session_id=0
  run_id=0
    task=query_transformation
      task=llm_predict
        task_duration=420ms
        tokens_used=1283
    task=embed
      task_duration=150ms
    task=retrieve
      task_duration=54ms
    task=llm_predict
 run_id=1
   ...

Flattened, this is:

{trace: 'session_id=0,run_id=0,task=embed', labels='embedding_model=openai[text-embedding-ada-002]', measurements:'task_duration=150ms'}
{trace: 'session_id=0,run_id=0,task=retrieve', labels='index=vector_index[weaviate[localhost:3003]]', measurements:'task_duration=330ms'}
{trace: 'session_id=0,run_id=0,task=llm_predict', labels='llm_model=openai[text-davinci-003]', measurements:'task_duration=450ms'}

Additional concepts:

label: a label is an identifier for a task. Labels are stored in event payloads with a fallback to defaults in session/run/task-level CallbackHandlers. Examples:
- llm_model: openai[text-davinci-003], custom[ggml-int4-q4]
- embedding_model: openai[text-embedding-ada-002]

Example Consumers

Prometheus
MLFlow
WandB

Example Aggregations

Here we use SQL. One should use their imagination for how these may be expressed in other query languages.

# Find the average task duration for each task
select task, avg(task_duration) as task_ms from llm_metrics where session_id=0 group by task;
task  |  task_ms
----------------------------
embed    | 155
retrieve  |  32
llm_predict | 425

# Find the total cost for each session
# One can define a specialized callback handler to precompute these
# But this way (dump in storage then analyze) works too
select 
  session_id,
  case
    when task='embed' and labels['embedding_model'] = 'openai[text-embedding-ada-002]' 
    then float(measurements['token_count']) * 0.00002
    when task='llm_predict' and labels['llm_model'] = 'openai[text-davinci-003]' 
    then float(measurements['token_count']) * 0.0002
  as dollar_cost
from 
  llm_metrics
group by
  session_id;

References

Questions

Instant (trigger) v.s. Eventual (collection):
- Implementation of callbacks seems to assume instantaneous actions. However, to my understanding, many use cases like tracing and logging only need eventual actions.
Equivalence / Conversion of tracing <-> logging?
- For instance, generically convert spans (event.start <-> event.end) into duration (e.g. prometheus/MLFlow metrics)

jon-chuang commented 1 year ago

TODO:

[ ] require more refinement
[ ] build prototype collection in MLFlow and Prometheus

jon-chuang commented 1 year ago

@logan-markewich After some thought, langchain's approach seemed a little off to me.

Rather than spawning and passing around entire new callback handlers and managers, the above ontology of session,run,task and passing around session,run,task<->subtask metadata, and having a centralized callback service seemed more sane to me. I will update as progress is made.

logan-markewich commented 1 year ago

@jon-chuang I totally agree RE langchains approach. Centralizing a service for this, and creating proper sessions/runs/tasks to handle async/parallel tracing sounds great to me.

It seems like this would be a possible sequence of PRs to support this:

Refactor/Replace current callback system with proposed ontology
Add some kind of centralized db/query interface for stored trace information
Add additional callback integrations (MLFlow, Prometheus)

Is this a correct understanding of the planned changes?

jon-chuang commented 1 year ago

Yes, that is approximately correct. However, 3 should come first in order to put pressure on the callback system as it is iteratively refactored so that we are aligned with the important use cases.

logan-markewich commented 1 year ago

Hmm, while that is true, if 3 is implemented now, doesn't that mean it will have to be refactored later? Just trying to think of the most efficient approach really, as it seems like a decent amount of work overall 😅 unless you don't see the overall interface for the callback handlers changing much

jon-chuang commented 1 year ago

I have the following plan:

Prometheus handler which uses new interface
Wrap the new interface in current interface (cannot handle async)
Build an e2e prometheus + grafana dashboard with useful metrics. Identify problems.
Decide if new interface makes sense. Else back to 1 iterating on internal interface.
If the new interface makes sense, figure out the best migration pathway towards the new interface. Tradeoffs on bw compat, perf, UX
Implement new interface, migrate existing callback handlers

logan-markewich commented 1 year ago

Sure! At a high level, this sounds good to me 👍👍

As always, let me know if there's anyway to help! Always happy to test and review PRs of course. 💪

cartermp commented 1 year ago

Have y'all looked into emitting OpenTelemetry traces for this? I'm working on defining semantic conventions for tracing data here, to be proposed to the OTel project soon: https://github.com/cartermp/semantic-conventions/blob/cartermp/ai/docs/ai/llm-spans.md

Critically, being OTel compliant allows any trace emitted by LLamaIndex to be automatically correlated with the rest of an application. That's critical for production use cases because there's often a rather complex pipeline for RAG or assembling a dynamic prompt, not to mention this is spread across different services or connected to other critical services.

Metrics is a decent enough signal type for really basic information, but for any actual application observability (when LLMs are in prod) you need good tracing data that connects to the rest of the application. Otherwise it's extremely difficult to determine if a poor user experience is directly related to an LLM call or if it's influenced by other complicating factors. You can't really pre-aggregate that information into metrics either, so traces are pretty much the only good option.

logan-markewich commented 1 year ago

@cartermp I did look into it initially -- but it seemed less helpful, at least for the stage we are at right now.

The traces need some external running client to consume them from my understanding

If you have an idea for integrating this properly with llama-index though, would love to see it in a PR ❤️

cartermp commented 1 year ago

@logan-markewich yeah, that's correct. It's designed to export to another tool that stores the data and offers analysis. That's a critical workflow for production use cases - your app that generates telemetry can't also be in the business of storing it because the volume of data would get out of control.

It's maybe helpful to think about two different models:

Custom tracing system that can export/transmute as opentelemetry traces
OTel-focused tracing model on the inside

The benefits to the latter are that you get incredible customizability and pluggability, although it's harder to do. But the downside is that if the way you need to model operations internally is difficult to map to OTel concepts, it's too hard. The first option is what a lot of tools offer instead. It's usually pretty easy to turn the "final trace product" into OTLP over gRPC or HTTP/proto/json.

janaka commented 11 months ago

Any further thoughts on adding Otel support. For tracing this is the standard format. Opens up to the entire ecosystem of providers that visualise.

And as @cartermp mentioned plus into the applications context monolith or distributed system.

Also worth pointing out the auto instrumentation feature that allows a code base to be instrumented with generic traces per dependency (e.g. Sqlite or FastAPI) without changing any code. Of course instrumenting business logic will need code changes.

dosubot[bot] commented 8 months ago

Hi, @jon-chuang,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, the issue proposes a tracing ontology for a retrieval pipeline, discussing concepts like session, run, and task, and raising questions about implementation of callbacks, equivalence or conversion of tracing and logging, and plans for prototype collection in MLFlow and Prometheus. There is also discussion about refactoring the current callback system, adding a centralized database/query interface for stored trace information, and integrating OpenTelemetry traces for better application observability.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you for your understanding and cooperation.

run-llama / llama_index