[Question]: How to add callback to VectorStoreIndex.from_documents

blacksmithop commented 1 month ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

How do I add a custom callback to VectorStoreIndex.from_documents I wish to track the embedding progress. I understand that CBEventType offers an EMBEDDING event type but I do not understand how I can go about tracking it

dosubot[bot] commented 1 month ago

To add a custom callback to VectorStoreIndex.from_documents to track the embedding progress using the EMBEDDING event type from CBEventType, you can follow this example:

Create a Custom Embedding Class: Implement a custom embedding class that uses the callback manager to track the embedding process.

from llama_index.legacy.callbacks.schema import CBEventType, EventPayload
from llama_index.legacy.core.embeddings.base import BaseEmbedding

class CustomEmbedding(BaseEmbedding):
    def __init__(self, callback_manager):
        self.callback_manager = callback_manager

    def _get_text_embedding(self, text):
        # Your embedding logic here
        pass

    def get_text_embedding(self, text):
        with self.callback_manager.event(
            CBEventType.EMBEDDING, payload={EventPayload.SERIALIZED: self.to_dict()}
        ) as event:
            text_embedding = self._get_text_embedding(text)
            event.on_end(
                payload={
                    EventPayload.CHUNKS: [text],
                    EventPayload.EMBEDDINGS: [text_embedding],
                },
            )
        return text_embedding

Use the Custom Embedding Class with VectorStoreIndex.from_documents: Initialize the custom embedding class and pass it to the VectorStoreIndex.from_documents method.

from llama_index import VectorStoreIndex

# Assuming you have a callback manager instance
callback_manager = ...

# Create an instance of your custom embedding class
custom_embedding = CustomEmbedding(callback_manager)

# Use the custom embedding instance with VectorStoreIndex.from_documents
documents = [...]  # Your list of documents
index = VectorStoreIndex.from_documents(documents, embedding=custom_embedding)

This setup will allow you to track the embedding progress using the EMBEDDING event type ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

blacksmithop commented 1 month ago

Lets say I wish to pass a reference to this class and it should be incremented / updated based on how many documents have been embedded. Also please let me know other options / events for tracking from_documents progress @dosu

dosubot[bot] commented 1 month ago

To pass a reference to a class that tracks and updates the number of documents embedded, you can introduce a tracker class and update the from_documents method to accept and use this tracker. Additionally, you can use the show_progress flag to track the progress of from_documents.

Here's an example of how you can achieve this:

Define a Tracker Class:

class DocumentTracker:
    def __init__(self):
        self.count = 0

    def update(self, num_docs: int):
        self.count += num_docs

    def get_count(self) -> int:
        return self.count

Modify the from_documents Method to Accept a Tracker Instance:

@classmethod
def from_documents(
    cls: Type[IndexType],
    documents: Sequence[Document],
    show_progress: bool = False,
    callback_manager: Optional[CallbackManager] = None,
    transformations: Optional[List[TransformComponent]] = None,
    tracker: Optional[DocumentTracker] = None,
    **kwargs: Any,
) -> IndexType:
    """Build a Vectara index from a sequence of documents."""
    nodes = [
        TextNode(text=document.get_content(), metadata=document.metadata)  # type: ignore
        for document in documents
    ]

    if tracker:
        tracker.update(len(documents))

    return cls(
        nodes=nodes,
        show_progress=show_progress,
        **kwargs,
    )

Use the Tracker When Calling from_documents:

tracker = DocumentTracker()
index = VectaraIndex.from_documents(documents, tracker=tracker)
print(f"Number of documents embedded: {tracker.get_count()}")

Other Options or Events for Tracking Progress

The show_progress flag can be used to show progress bars using libraries like tqdm. You can integrate this into the from_documents method as follows:

from tqdm import tqdm

@classmethod
def from_documents(
    cls: Type[IndexType],
    documents: Sequence[Document],
    show_progress: bool = False,
    callback_manager: Optional[CallbackManager] = None,
    transformations: Optional[List[TransformComponent]] = None,
    tracker: Optional[DocumentTracker] = None,
    **kwargs: Any,
) -> IndexType:
    """Build a Vectara index from a sequence of documents."""
    if show_progress:
        documents = tqdm(documents, desc="Embedding documents")

    nodes = [
        TextNode(text=document.get_content(), metadata=document.metadata)  # type: ignore
        for document in documents
    ]

    if tracker:
        tracker.update(len(documents))

    return cls(
        nodes=nodes,
        show_progress=show_progress,
        **kwargs,
    )

Additionally, LlamaIndex provides a robust callback system to help debug, track, and trace the inner workings of the library. You can use the CallbackManager to add multiple callbacks to track various events such as CHUNKING, NODE_PARSING, EMBEDDING, and more. Here are some of the available callback event types:

CHUNKING -> Logs for the before and after of text splitting.
NODE_PARSING -> Logs for the documents and the nodes that they are parsed into.
EMBEDDING -> Logs for the number of texts embedded.
LLM -> Logs for the template and response of LLM calls.
QUERY -> Keeps track of the start and end of each query.
RETRIEVE -> Logs for the nodes retrieved for a query.
SYNTHESIZE -> Logs for the result for synthesize calls.
TREE -> Logs for the summary and level of summaries generated.
SUB_QUESTION -> Log for a generated sub question and answer.

You can implement your own callback to track and trace these events or use existing callbacks such as TokenCountingHandler, LlamaDebugHandler, WandbCallbackHandler, AimCallback, OpenInferenceCallbackHandler, and OpenAIFineTuningHandler ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

run-llama / llama_index