run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.38k stars 4.98k forks source link

[Question]: How to add callback to VectorStoreIndex.from_documents #15088

Open blacksmithop opened 1 month ago

blacksmithop commented 1 month ago

Question Validation

Question

How do I add a custom callback to VectorStoreIndex.from_documents I wish to track the embedding progress. I understand that CBEventType offers an EMBEDDING event type but I do not understand how I can go about tracking it

dosubot[bot] commented 1 month ago

To add a custom callback to VectorStoreIndex.from_documents to track the embedding progress using the EMBEDDING event type from CBEventType, you can follow this example:

  1. Create a Custom Embedding Class: Implement a custom embedding class that uses the callback manager to track the embedding process.
from llama_index.legacy.callbacks.schema import CBEventType, EventPayload
from llama_index.legacy.core.embeddings.base import BaseEmbedding

class CustomEmbedding(BaseEmbedding):
    def __init__(self, callback_manager):
        self.callback_manager = callback_manager

    def _get_text_embedding(self, text):
        # Your embedding logic here
        pass

    def get_text_embedding(self, text):
        with self.callback_manager.event(
            CBEventType.EMBEDDING, payload={EventPayload.SERIALIZED: self.to_dict()}
        ) as event:
            text_embedding = self._get_text_embedding(text)
            event.on_end(
                payload={
                    EventPayload.CHUNKS: [text],
                    EventPayload.EMBEDDINGS: [text_embedding],
                },
            )
        return text_embedding
  1. Use the Custom Embedding Class with VectorStoreIndex.from_documents: Initialize the custom embedding class and pass it to the VectorStoreIndex.from_documents method.
from llama_index import VectorStoreIndex

# Assuming you have a callback manager instance
callback_manager = ...

# Create an instance of your custom embedding class
custom_embedding = CustomEmbedding(callback_manager)

# Use the custom embedding instance with VectorStoreIndex.from_documents
documents = [...]  # Your list of documents
index = VectorStoreIndex.from_documents(documents, embedding=custom_embedding)

This setup will allow you to track the embedding progress using the EMBEDDING event type [1][2][3].

To continue talking to Dosu, mention @dosu.

blacksmithop commented 1 month ago

Lets say I wish to pass a reference to this class and it should be incremented / updated based on how many documents have been embedded. Also please let me know other options / events for tracking from_documents progress @dosu

dosubot[bot] commented 1 month ago

To pass a reference to a class that tracks and updates the number of documents embedded, you can introduce a tracker class and update the from_documents method to accept and use this tracker. Additionally, you can use the show_progress flag to track the progress of from_documents.

Here's an example of how you can achieve this:

  1. Define a Tracker Class:
class DocumentTracker:
    def __init__(self):
        self.count = 0

    def update(self, num_docs: int):
        self.count += num_docs

    def get_count(self) -> int:
        return self.count
  1. Modify the from_documents Method to Accept a Tracker Instance:
@classmethod
def from_documents(
    cls: Type[IndexType],
    documents: Sequence[Document],
    show_progress: bool = False,
    callback_manager: Optional[CallbackManager] = None,
    transformations: Optional[List[TransformComponent]] = None,
    tracker: Optional[DocumentTracker] = None,
    **kwargs: Any,
) -> IndexType:
    """Build a Vectara index from a sequence of documents."""
    nodes = [
        TextNode(text=document.get_content(), metadata=document.metadata)  # type: ignore
        for document in documents
    ]

    if tracker:
        tracker.update(len(documents))

    return cls(
        nodes=nodes,
        show_progress=show_progress,
        **kwargs,
    )
  1. Use the Tracker When Calling from_documents:
tracker = DocumentTracker()
index = VectaraIndex.from_documents(documents, tracker=tracker)
print(f"Number of documents embedded: {tracker.get_count()}")

Other Options or Events for Tracking Progress

The show_progress flag can be used to show progress bars using libraries like tqdm. You can integrate this into the from_documents method as follows:

from tqdm import tqdm

@classmethod
def from_documents(
    cls: Type[IndexType],
    documents: Sequence[Document],
    show_progress: bool = False,
    callback_manager: Optional[CallbackManager] = None,
    transformations: Optional[List[TransformComponent]] = None,
    tracker: Optional[DocumentTracker] = None,
    **kwargs: Any,
) -> IndexType:
    """Build a Vectara index from a sequence of documents."""
    if show_progress:
        documents = tqdm(documents, desc="Embedding documents")

    nodes = [
        TextNode(text=document.get_content(), metadata=document.metadata)  # type: ignore
        for document in documents
    ]

    if tracker:
        tracker.update(len(documents))

    return cls(
        nodes=nodes,
        show_progress=show_progress,
        **kwargs,
    )

Additionally, LlamaIndex provides a robust callback system to help debug, track, and trace the inner workings of the library. You can use the CallbackManager to add multiple callbacks to track various events such as CHUNKING, NODE_PARSING, EMBEDDING, and more. Here are some of the available callback event types:

You can implement your own callback to track and trace these events or use existing callbacks such as TokenCountingHandler, LlamaDebugHandler, WandbCallbackHandler, AimCallback, OpenInferenceCallbackHandler, and OpenAIFineTuningHandler [1][2][3].

To continue talking to Dosu, mention @dosu.