Open blacksmithop opened 1 month ago
To add a custom callback to VectorStoreIndex.from_documents
to track the embedding progress using the EMBEDDING
event type from CBEventType
, you can follow this example:
from llama_index.legacy.callbacks.schema import CBEventType, EventPayload
from llama_index.legacy.core.embeddings.base import BaseEmbedding
class CustomEmbedding(BaseEmbedding):
def __init__(self, callback_manager):
self.callback_manager = callback_manager
def _get_text_embedding(self, text):
# Your embedding logic here
pass
def get_text_embedding(self, text):
with self.callback_manager.event(
CBEventType.EMBEDDING, payload={EventPayload.SERIALIZED: self.to_dict()}
) as event:
text_embedding = self._get_text_embedding(text)
event.on_end(
payload={
EventPayload.CHUNKS: [text],
EventPayload.EMBEDDINGS: [text_embedding],
},
)
return text_embedding
VectorStoreIndex.from_documents
:
Initialize the custom embedding class and pass it to the VectorStoreIndex.from_documents
method.from llama_index import VectorStoreIndex
# Assuming you have a callback manager instance
callback_manager = ...
# Create an instance of your custom embedding class
custom_embedding = CustomEmbedding(callback_manager)
# Use the custom embedding instance with VectorStoreIndex.from_documents
documents = [...] # Your list of documents
index = VectorStoreIndex.from_documents(documents, embedding=custom_embedding)
This setup will allow you to track the embedding progress using the EMBEDDING
event type [1][2][3].
To continue talking to Dosu, mention @dosu.
Lets say I wish to pass a reference to this class and it should be incremented / updated based on how many documents have been embedded.
Also please let me know other options / events for tracking from_documents
progress
@dosu
To pass a reference to a class that tracks and updates the number of documents embedded, you can introduce a tracker class and update the from_documents
method to accept and use this tracker. Additionally, you can use the show_progress
flag to track the progress of from_documents
.
Here's an example of how you can achieve this:
class DocumentTracker:
def __init__(self):
self.count = 0
def update(self, num_docs: int):
self.count += num_docs
def get_count(self) -> int:
return self.count
from_documents
Method to Accept a Tracker Instance:@classmethod
def from_documents(
cls: Type[IndexType],
documents: Sequence[Document],
show_progress: bool = False,
callback_manager: Optional[CallbackManager] = None,
transformations: Optional[List[TransformComponent]] = None,
tracker: Optional[DocumentTracker] = None,
**kwargs: Any,
) -> IndexType:
"""Build a Vectara index from a sequence of documents."""
nodes = [
TextNode(text=document.get_content(), metadata=document.metadata) # type: ignore
for document in documents
]
if tracker:
tracker.update(len(documents))
return cls(
nodes=nodes,
show_progress=show_progress,
**kwargs,
)
from_documents
:tracker = DocumentTracker()
index = VectaraIndex.from_documents(documents, tracker=tracker)
print(f"Number of documents embedded: {tracker.get_count()}")
The show_progress
flag can be used to show progress bars using libraries like tqdm
. You can integrate this into the from_documents
method as follows:
from tqdm import tqdm
@classmethod
def from_documents(
cls: Type[IndexType],
documents: Sequence[Document],
show_progress: bool = False,
callback_manager: Optional[CallbackManager] = None,
transformations: Optional[List[TransformComponent]] = None,
tracker: Optional[DocumentTracker] = None,
**kwargs: Any,
) -> IndexType:
"""Build a Vectara index from a sequence of documents."""
if show_progress:
documents = tqdm(documents, desc="Embedding documents")
nodes = [
TextNode(text=document.get_content(), metadata=document.metadata) # type: ignore
for document in documents
]
if tracker:
tracker.update(len(documents))
return cls(
nodes=nodes,
show_progress=show_progress,
**kwargs,
)
Additionally, LlamaIndex provides a robust callback system to help debug, track, and trace the inner workings of the library. You can use the CallbackManager
to add multiple callbacks to track various events such as CHUNKING
, NODE_PARSING
, EMBEDDING
, and more. Here are some of the available callback event types:
CHUNKING
-> Logs for the before and after of text splitting.NODE_PARSING
-> Logs for the documents and the nodes that they are parsed into.EMBEDDING
-> Logs for the number of texts embedded.LLM
-> Logs for the template and response of LLM calls.QUERY
-> Keeps track of the start and end of each query.RETRIEVE
-> Logs for the nodes retrieved for a query.SYNTHESIZE
-> Logs for the result for synthesize calls.TREE
-> Logs for the summary and level of summaries generated.SUB_QUESTION
-> Log for a generated sub question and answer.You can implement your own callback to track and trace these events or use existing callbacks such as TokenCountingHandler
, LlamaDebugHandler
, WandbCallbackHandler
, AimCallback
, OpenInferenceCallbackHandler
, and OpenAIFineTuningHandler
[1][2][3].
To continue talking to Dosu, mention @dosu.
Question Validation
Question
How do I add a custom callback to
VectorStoreIndex.from_documents
I wish to track the embedding progress. I understand thatCBEventType
offers anEMBEDDING
event type but I do not understand how I can go about tracking it