langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.27k stars 14.74k forks source link

Add callback support for embeddings #8564

Closed axiomofjoy closed 10 months ago

axiomofjoy commented 1 year ago

Feature request

Add embedding support to the callback system. Here is one approach I have in mind.

One minimally invasive approach would be:

Motivation

Embeddings are useful for LLM application monitoring and debugging. I want to build a callback handler that enables LangChain users to visualize their data in Phoenix, an open-source tool that provides debugging workflows for retrieval-augmented generation. At the moment, it is not possible to get the query embeddings out of LangChain's callback system, for example, when using the RetrievalQA chain. Here is an example notebook where I sub-class OpenAIEmbeddings to get out the embedding data:

class OpenAIEmbeddingsWrapper(OpenAIEmbeddings):
    """
    A wrapper around OpenAIEmbeddings that stores the query and document
    embeddings.
    """

    query_text_to_embedding: Dict[str, List[float]] = {}
    document_text_to_embedding: Dict[str, List[float]] = {}

    def embed_query(self, text: str) -> List[float]:
        embedding = super().embed_query(text)
        self.query_text_to_embedding[text] = embedding
        return embedding

    def embed_documents(self, texts: List[str], chunk_size: Optional[int] = 0) -> List[List[float]]:
        embeddings = super().embed_documents(texts, chunk_size)
        for text, embedding in zip(texts, embeddings):
            self.document_text_to_embedding[text] = embedding
        return embeddings

    @property
    def query_embedding_dataframe(self) -> pd.DataFrame:
        return self._convert_text_to_embedding_map_to_dataframe(self.query_text_to_embedding)

    @property
    def document_embedding_dataframe(self) -> pd.DataFrame:
        return self._convert_text_to_embedding_map_to_dataframe(self.document_text_to_embedding)

    @staticmethod
    def _convert_text_to_embedding_map_to_dataframe(
        text_to_embedding: Dict[str, List[float]]
    ) -> pd.DataFrame:
        texts, embeddings = map(list, zip(*text_to_embedding.items()))
        embedding_arrays = [np.array(embedding) for embedding in embeddings]
        return pd.DataFrame.from_dict(
            {
                "text": texts,
                "text_vector": embedding_arrays,
            }
        )

I would like the LangChain callback system to support this use-case.

This feature has been requested for TypeScript and has an open PR. An additional motivation is to maintain parity with the TypeScript library.

Your contribution

I am willing to implement, test, and document this feature with guidance from the LangChain team. I am also happy to provide feedback on an implementation by the LangChain team by building an example callback handler using the embeddings hook.

ppramesi commented 1 year ago

There's already an open PR for embedding callback if I'm not mistaken #7920

dosubot[bot] commented 10 months ago

Hi, @axiomofjoy! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you requested the addition of callback support for embeddings in the LangChain library. You proposed a specific approach for implementation and mentioned that you are willing to contribute to the process. Another user, @ppramesi, mentioned that there is already an open pull request (#7920) for embedding callback. It seems like you reacted positively to this comment.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!