Add callback support for embeddings

axiomofjoy commented 1 year ago

Feature request

Add embedding support to the callback system. Here is one approach I have in mind.

[ ] Add on_embedding_start method on CallbackManagerMixin in libs/langchain/langchain/callbacks/base.py.
[ ] Implement EmbeddingManagerMixin with on_embedding_end and on_embedding_error methods in libs/langchain/langchain/callbacks/base.py.
[ ] Add embedding callback hook to Embeddings abstract base class in libs/langchain/langchain/embeddings/base.py.
[ ] Tweak concrete embeddings implementations in libs/langchain/langchain/embeddings as necessary.

One minimally invasive approach would be:

Implement concrete embed_documents, embed_query, aembed_documents, and aembed_query methods on the abstract Embeddings base class that contain the embeddings callback hook. Add abstract methods _embed_documents and _embed_query methods and unimplemented _aembed_documents and _aembed_query methods to the base class.
Rename existing concrete implementations of embed_documents, embed_query, aembed_documents, and aembed_query to _embed_documents, _embed_query, _aembed_documents, and _aembed_query.

Motivation

Embeddings are useful for LLM application monitoring and debugging. I want to build a callback handler that enables LangChain users to visualize their data in Phoenix, an open-source tool that provides debugging workflows for retrieval-augmented generation. At the moment, it is not possible to get the query embeddings out of LangChain's callback system, for example, when using the RetrievalQA chain. Here is an example notebook where I sub-class OpenAIEmbeddings to get out the embedding data:

class OpenAIEmbeddingsWrapper(OpenAIEmbeddings):
    """
    A wrapper around OpenAIEmbeddings that stores the query and document
    embeddings.
    """

    query_text_to_embedding: Dict[str, List[float]] = {}
    document_text_to_embedding: Dict[str, List[float]] = {}

    def embed_query(self, text: str) -> List[float]:
        embedding = super().embed_query(text)
        self.query_text_to_embedding[text] = embedding
        return embedding

    def embed_documents(self, texts: List[str], chunk_size: Optional[int] = 0) -> List[List[float]]:
        embeddings = super().embed_documents(texts, chunk_size)
        for text, embedding in zip(texts, embeddings):
            self.document_text_to_embedding[text] = embedding
        return embeddings

    @property
    def query_embedding_dataframe(self) -> pd.DataFrame:
        return self._convert_text_to_embedding_map_to_dataframe(self.query_text_to_embedding)

    @property
    def document_embedding_dataframe(self) -> pd.DataFrame:
        return self._convert_text_to_embedding_map_to_dataframe(self.document_text_to_embedding)

    @staticmethod
    def _convert_text_to_embedding_map_to_dataframe(
        text_to_embedding: Dict[str, List[float]]
    ) -> pd.DataFrame:
        texts, embeddings = map(list, zip(*text_to_embedding.items()))
        embedding_arrays = [np.array(embedding) for embedding in embeddings]
        return pd.DataFrame.from_dict(
            {
                "text": texts,
                "text_vector": embedding_arrays,
            }
        )

I would like the LangChain callback system to support this use-case.

This feature has been requested for TypeScript and has an open PR. An additional motivation is to maintain parity with the TypeScript library.

Your contribution

I am willing to implement, test, and document this feature with guidance from the LangChain team. I am also happy to provide feedback on an implementation by the LangChain team by building an example callback handler using the embeddings hook.

ppramesi commented 1 year ago

There's already an open PR for embedding callback if I'm not mistaken #7920

dosubot[bot] commented 10 months ago

Hi, @axiomofjoy! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you requested the addition of callback support for embeddings in the LangChain library. You proposed a specific approach for implementation and mentioned that you are willing to contribute to the process. Another user, @ppramesi, mentioned that there is already an open pull request (#7920) for embedding callback. It seems like you reacted positively to this comment.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

langchain-ai / langchain