[Bug]: Callback events are not enclosed within a trace

axiomofjoy commented 1 year ago

Bug Description

While building a callback and testing with an OpenAI chat agent, I noticed that the on_event_start and on_event_end callback hooks are called without being preceded by a call to start_trace and without being followed by a call to end_trace. I had implemented the flush logic inside of end_trace, so my accumulated trace data is not flushed at the end of the trace.

Version

latest main (3506143d)

Steps to Reproduce

Run the following script:

import sys
import logging
from gcsfs import GCSFileSystem
from llama_index import ServiceContext, StorageContext, load_index_from_storage
from llama_index.callbacks import CallbackManager
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.graph_stores.simple import SimpleGraphStore
from llama_index.llms import ChatMessage, MessageRole, OpenAI
from llama_index.callbacks.base_handler import BaseCallbackHandler
from llama_index.callbacks.schema import CBEventType
from typing import Any, Dict, Optional, List

logger = logging.getLogger(__name__)

logging.basicConfig(level=logging.DEBUG, stream=sys.stdout)

class DebugCallbackHandler(BaseCallbackHandler):
    def on_event_start(
        self,
        event_type: CBEventType,
        payload: Optional[Dict[str, Any]] = None,
        event_id: str = "",
        **kwargs: Any,
    ) -> str:
        logger.info(f"on_event_start: {event_type.value}")

    def on_event_end(
        self,
        event_type: CBEventType,
        payload: Optional[Dict[str, Any]] = None,
        event_id: str = "",
        **kwargs: Any,
    ) -> None:
        logger.info(f"on_event_end: {event_type.value}")

    def start_trace(self, trace_id: Optional[str] = None) -> None:
        logger.info(f"start_trace: {trace_id}")

    def end_trace(
        self,
        trace_id: Optional[str] = None,
        trace_map: Optional[Dict[str, List[str]]] = None,
    ) -> None:
        logger.info(f"end_trace: {trace_id}")

file_system = GCSFileSystem(project="public-assets-275721")
index_path = (
    "arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/index/"
)
storage_context = StorageContext.from_defaults(
    fs=file_system,
    persist_dir=index_path,
    graph_store=SimpleGraphStore(),  # prevents unauthorized request to GCS
)
callback_handler = DebugCallbackHandler(
    event_ends_to_ignore=[], event_starts_to_ignore=[]
)
service_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0),
    embed_model=OpenAIEmbedding(model="text-embedding-ada-002"),
    callback_manager=CallbackManager(handlers=[callback_handler]),
)
index = load_index_from_storage(
    storage_context,
    service_context=service_context,
)
agent = index.as_chat_engine(
    chat_mode="openai",
    callback_manager=CallbackManager(handlers=[callback_handler]),
)

response = agent.chat(
    "Can you explain what that means?",
    chat_history=[
        ChatMessage(role=MessageRole.USER, content="What is Arize?"),
        ChatMessage(
            role=MessageRole.ASSISTANT,
            content="Arize is a ML observability platform.",
        ),
    ],
)
print(response)

Relevant Logs/Tracbacks

DEBUG:asyncio:Using selector: KqueueSelector
DEBUG:google.auth._default:Checking None for explicit credentials as part of auth process...
DEBUG:google.auth._default:Checking Cloud SDK credentials as part of auth process...
DEBUG:gcsfs.credentials:Connection with method "google_default" failed
Traceback (most recent call last):
  File "/Users/xandersong/miniforge3/envs/llmapps/lib/python3.10/site-packages/gcsfs/credentials.py", line 232, in connect
    self.connect(method=meth)
  File "/Users/xandersong/miniforge3/envs/llmapps/lib/python3.10/site-packages/gcsfs/credentials.py", line 249, in connect
    self.__getattribute__("_connect_" + method)()
  File "/Users/xandersong/miniforge3/envs/llmapps/lib/python3.10/site-packages/gcsfs/credentials.py", line 88, in _connect_google_default
    raise ValueError(msg.format(self.project, project))
ValueError: User-provided project 'public-assets-275721' does not match the google default project 'primal-oxide-268801'. Either

  1. Accept the google-default project by not passing a `project` to GCSFileSystem
  2. Configure the default project to match the user-provided project (gcloud config set project)
  3. Use an authorization method other than 'google_default' by providing 'token=...'

DEBUG:gcsfs.credentials:Connection with method "cache" failed
Traceback (most recent call last):
  File "/Users/xandersong/miniforge3/envs/llmapps/lib/python3.10/site-packages/gcsfs/credentials.py", line 232, in connect
    self.connect(method=meth)
  File "/Users/xandersong/miniforge3/envs/llmapps/lib/python3.10/site-packages/gcsfs/credentials.py", line 249, in connect
    self.__getattribute__("_connect_" + method)()
  File "/Users/xandersong/miniforge3/envs/llmapps/lib/python3.10/site-packages/gcsfs/credentials.py", line 104, in _connect_cache
    raise ValueError("No cached tokens")
ValueError: No cached tokens
DEBUG:google.auth.transport.requests:Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): metadata.google.internal:80
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x172985270>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
DEBUG:google.auth.transport.requests:Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true
DEBUG:urllib3.connectionpool:Starting new HTTP connection (2): metadata.google.internal:80
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 2 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x172985570>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
DEBUG:google.auth.transport.requests:Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true
DEBUG:urllib3.connectionpool:Starting new HTTP connection (3): metadata.google.internal:80
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 3 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x172985930>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
DEBUG:google.auth.transport.requests:Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true
DEBUG:urllib3.connectionpool:Starting new HTTP connection (4): metadata.google.internal:80
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 4 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x172985e70>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
DEBUG:google.auth.transport.requests:Making request: GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true
DEBUG:urllib3.connectionpool:Starting new HTTP connection (5): metadata.google.internal:80
WARNING:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable on attempt 5 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x1729863b0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
DEBUG:gcsfs.credentials:Connection with method "cloud" failed
Traceback (most recent call last):
  File "/Users/xandersong/miniforge3/envs/llmapps/lib/python3.10/site-packages/google/auth/compute_engine/credentials.py", line 115, in refresh
    self._retrieve_info(request)
  File "/Users/xandersong/miniforge3/envs/llmapps/lib/python3.10/site-packages/google/auth/compute_engine/credentials.py", line 88, in _retrieve_info
    info = _metadata.get_service_account_info(
  File "/Users/xandersong/miniforge3/envs/llmapps/lib/python3.10/site-packages/google/auth/compute_engine/_metadata.py", line 287, in get_service_account_info
    return get(request, path, params={"recursive": "true"})
  File "/Users/xandersong/miniforge3/envs/llmapps/lib/python3.10/site-packages/google/auth/compute_engine/_metadata.py", line 216, in get
    raise exceptions.TransportError(
google.auth.exceptions.TransportError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/xandersong/miniforge3/envs/llmapps/lib/python3.10/site-packages/gcsfs/credentials.py", line 97, in _connect_cloud
    self.credentials.refresh(req)
  File "/Users/xandersong/miniforge3/envs/llmapps/lib/python3.10/site-packages/google/auth/compute_engine/credentials.py", line 121, in refresh
    six.raise_from(new_exc, caught_exc)
  File "<string>", line 3, in raise_from
google.auth.exceptions.RefreshError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/xandersong/miniforge3/envs/llmapps/lib/python3.10/site-packages/gcsfs/credentials.py", line 232, in connect
    self.connect(method=meth)
  File "/Users/xandersong/miniforge3/envs/llmapps/lib/python3.10/site-packages/gcsfs/credentials.py", line 249, in connect
    self.__getattribute__("_connect_" + method)()
  File "/Users/xandersong/miniforge3/envs/llmapps/lib/python3.10/site-packages/gcsfs/credentials.py", line 99, in _connect_cloud
    raise ValueError("Invalid gcloud credentials") from error
ValueError: Invalid gcloud credentials
DEBUG:gcsfs.credentials:Connected with method anon
DEBUG:llama_index.storage.kvstore.simple_kvstore:Loading llama_index.storage.kvstore.simple_kvstore from arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/index/docstore.json.
DEBUG:gcsfs:GET: b/{}/o/{}, ('arize-assets', 'phoenix/datasets/unstructured/llm/llama-index/arize-docs/index/docstore.json'), None
DEBUG:fsspec:<File-like object GCSFileSystem, arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/index/docstore.json> read: 0 - 1678221
DEBUG:gcsfs:GET: https://storage.googleapis.com/download/storage/v1/b/arize-assets/o/phoenix%2Fdatasets%2Funstructured%2Fllm%2Fllama-index%2Farize-docs%2Findex%2Fdocstore.json?alt=media, (), {'Range': 'bytes=0-1678220'}
DEBUG:llama_index.storage.kvstore.simple_kvstore:Loading llama_index.storage.kvstore.simple_kvstore from arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/index/index_store.json.
DEBUG:gcsfs:GET: b/{}/o/{}, ('arize-assets', 'phoenix/datasets/unstructured/llm/llama-index/arize-docs/index/index_store.json'), None
DEBUG:fsspec:<File-like object GCSFileSystem, arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/index/index_store.json> read: 0 - 107263
DEBUG:gcsfs:GET: https://storage.googleapis.com/download/storage/v1/b/arize-assets/o/phoenix%2Fdatasets%2Funstructured%2Fllm%2Fllama-index%2Farize-docs%2Findex%2Findex_store.json?alt=media, (), {'Range': 'bytes=0-107262'}
DEBUG:gcsfs:GET: b/{}/o/{}, ('arize-assets', 'phoenix/datasets/unstructured/llm/llama-index/arize-docs/index/vector_store.json'), None
DEBUG:llama_index.vector_stores.simple:Loading llama_index.vector_stores.simple from arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/index/vector_store.json.
DEBUG:gcsfs:GET: b/{}/o/{}, ('arize-assets', 'phoenix/datasets/unstructured/llm/llama-index/arize-docs/index/vector_store.json'), None
DEBUG:fsspec:<File-like object GCSFileSystem, arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/index/vector_store.json> read: 0 - 43966360
DEBUG:gcsfs:GET: https://storage.googleapis.com/download/storage/v1/b/arize-assets/o/phoenix%2Fdatasets%2Funstructured%2Fllm%2Fllama-index%2Farize-docs%2Findex%2Fvector_store.json?alt=media, (), {'Range': 'bytes=0-43966359'}
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/xandersong/Library/Caches/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.
INFO:llama_index.indices.loading:Loading all indices.
INFO:__main__:start_trace: index_construction
INFO:__main__:end_trace: index_construction
INFO:__main__:on_event_start: llm
DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/chat/completions
DEBUG:openai:api_version=None data='{"messages": [{"role": "user", "content": "What is Arize?"}, {"role": "assistant", "content": "Arize is a ML observability platform."}, {"role": "user", "content": "Can you explain what that means?"}], "stream": false, "model": "gpt-3.5-turbo", "temperature": 0, "max_tokens": null, "functions": [{"name": "query_engine_tool", "description": "Useful for running a natural language query\\nagainst a knowledge base and get back a natural language response.\\n", "parameters": {"title": "DefaultToolFnSchema", "description": "Default tool function Schema.", "type": "object", "properties": {"input": {"title": "Input", "type": "string"}}, "required": ["input"]}}], "function_call": "auto"}' message='Post details'
DEBUG:urllib3.util.retry:Converted retries value: 2 -> Retry(total=2, connect=None, read=None, redirect=None, status=None)
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.openai.com:443
DEBUG:urllib3.connectionpool:https://api.openai.com:443 "POST /v1/chat/completions HTTP/1.1" 200 None
DEBUG:openai:message='OpenAI API response' path=https://api.openai.com/v1/chat/completions processing_ms=4666 request_id=5136dd250bd363ef02e6331851a3eaeb response_code=200
INFO:__main__:on_event_end: llm
Certainly! ML observability refers to the ability to understand and monitor the behavior and performance of machine learning models in production. It involves tracking and analyzing various metrics, such as model accuracy, latency, and resource utilization, to ensure that the models are performing as expected and to identify any issues or anomalies.

Arize is a platform that provides ML observability tools and capabilities. It helps data scientists, machine learning engineers, and AI practitioners gain insights into their models' performance and behavior in real-world scenarios. Arize allows users to monitor and analyze key metrics, visualize model performance, and detect any drift or degradation in model performance over time.

By using Arize, organizations can ensure that their machine learning models are delivering accurate and reliable results, identify and troubleshoot any issues that may arise, and ultimately improve the overall performance and effectiveness of their ML systems.

dosubot[bot] commented 1 year ago

🤖

Hello,

From your description, it seems like you're expecting the start_trace and end_trace methods to be called for each event. However, in the LlamaIndex codebase, these methods are not designed to be called for each event. Instead, they are used to start and end a trace, which is a collection of events.

The start_trace method is called when a trace is launched, and the end_trace method is called when a trace is exited. These methods are not tied to individual events, but rather to a collection of events that make up a trace. If you've implemented your flush logic inside of end_trace, it would only be called when a trace is ended, not when an individual event ends.

The on_event_start and on_event_end methods are the ones that are called for each event. If you want to perform some action at the start or end of each event, you should implement that logic inside these methods.

As for the ValueError related to mismatched project names in GCSFileSystem and issues with the Google Compute Engine metadata service, it would be helpful if you could provide more details about the error message and the context in which it occurs. This would help in diagnosing the issue and providing a solution.

I hope this clarifies the usage of start_trace, end_trace, on_event_start, and on_event_end in the LlamaIndex codebase. If you have any more questions, feel free to ask.

Best, Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

logan-markewich commented 1 year ago

@axiomofjoy so the reason is

we moved callbacks to the base LLM object -- which means their events get tracked everywhere, even in agents/chat engines
We don't have specific traces to the agent yet

I guess the solution here is to add trace events to the chat engines/agents

I can try and tackle that tomorrow morning 👍🏻 Shouldn't be too bad

axiomofjoy commented 1 year ago

Sounds good, thanks @logan-markewich!

sourabhdesai commented 1 year ago

@axiomofjoy created a PR to tackle this!

https://github.com/jerryjliu/llama_index/pull/7304

axiomofjoy commented 1 year ago

@sourabhdesai Nice, excited to try it out!

run-llama / llama_index