langchain-ai / langchain

πŸ¦œπŸ”— Build context-aware reasoning applications
https://python.langchain.com
MIT License
89.39k stars 14.1k forks source link

ChromaDb EmbeddingFunction definition updated #13061

Closed jayant-yadav closed 8 months ago

jayant-yadav commented 8 months ago

System Info

Using Google Colab Free version with T4 GPU. chromadb==0.4.16

Who can help?

@agola11 @hwchase17

Information

Related Components

Reproduction

As per the latest Chromadb migration logs (link) EmbeddingFunction defnition has been updated and it affects all the custom made embedding function.
What this means is the langchain.embeddings.HuggingFaceBgeEmbeddings is inconsistent with this new definition and throws the following error:

ValueError: Expected EmbeddingFunction.__call__ to have the following signature: odict_keys(['self', 'input']), got odict_keys(['self', 'args', 'kwargs'])
Please see https://docs.trychroma.com/embeddings for details of the EmbeddingFunction interface.
Please note the recent change to the EmbeddingFunction interface: https://docs.trychroma.com/migration#migration-to-0416---november-7-2023 

The above code can be reproduced by inserting documents into Chromadb embedded using HuggingFaceBgeEmbeddings like so:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceBgeEmbeddings
from transformers import AutoTokenizer

embedding_function = HuggingFaceBgeEmbeddings(
    model_name="BAAI/bge-base-en-v1.5",
    model_kwargs={'device': 'cuda'},
    encode_kwargs={'normalize_embeddings': True},
    query_instruction="Represent this sentence for searching relevant passages: "
)  
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-base-en-v1.5')
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=100, chunk_overlap=0
)
text = 'Some text that needs to be embedded.'
print(len(embedding_function.embed_query(text))) # works so far
splits = text_splitter.create_documents([text])
db = Chroma.from_documents(splits, embedding_function, persist_directory="./chroma_db")

I am not sure, but the answer might lie in correcting the HuggingFaceBgeEmbeddings class : link ?

Expected behavior

The expected behaviour would have made a valid db object upon running the code


db = Chroma.from_documents(splits, embedding_function, persist_directory="./chroma_db")  
dosubot[bot] commented 8 months ago

πŸ€–

Hello @jayant-yadav,

Nice to meet you! I'm here to assist you with your issues and questions related to the LangChain repository. I'm a bot created to provide immediate responses while you're waiting for a human maintainer. Please don't hesitate to ask me anything you need. Let's get started!

Based on the error message you provided, it seems that the EmbeddingFunction interface has indeed changed. The error message suggests that the EmbeddingFunction.__call__ method is now expected to have the signature odict_keys(['self', 'input']), but the HuggingFaceBgeEmbeddings class has the signature odict_keys(['self', 'args', 'kwargs']).

The EmbeddingFunction is not explicitly defined in the LangChain codebase, but it is used as an interface for embedding functions. The HuggingFaceBgeEmbeddings class, which implements this interface, uses the sentence_transformers.SentenceTransformer to encode or embed the documents and queries.

To resolve this issue, you might need to update the HuggingFaceBgeEmbeddings class to match the new EmbeddingFunction interface. Specifically, you might need to change the embed_documents and embed_query methods to accept a single 'input' argument instead of 'args' and 'kwargs'. Here's an example of how you might do this:

def embed_documents(self, input: List[str]) -> List[List[float]]:
    """Compute doc embeddings using a HuggingFace transformer model.

    Args:
        input: The list of texts to embed.

    Returns:
        List of embeddings, one for each text.
    """
    input = [t.replace("\n", " ") for t in input]
    embeddings = self.client.encode(input, **self.encode_kwargs)
    return embeddings.tolist()

def embed_query(self, input: str) -> List[float]:
    """Compute query embeddings using a HuggingFace transformer model.

    Args:
        input: The text to embed.

    Returns:
        Embeddings for the text.
    """
    input = input.replace("\n", " ")
    embedding = self.client.encode(
        self.query_instruction + input, **self.encode_kwargs
    )
    return embedding.tolist()

Please note that this is just a suggestion and might not fully resolve the issue. You might need to make additional changes to the HuggingFaceBgeEmbeddings class to fully comply with the new EmbeddingFunction interface. I recommend checking the ChromaDB migration logs and the EmbeddingFunction interface documentation for more details.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

mattalford commented 8 months ago

Quick work around until the HuggingFaceEmbeddings is updated. pip install chromadb==0.4.15

This breaking change affects chromadb 0.4.16

efriis commented 8 months ago

Langchain 0.0.332 released with the chroma team's fix for compatibility with chromadb>=0.4.16

Can now use latest of both pip install -U langchain chromadb

efriis commented 8 months ago

Actually looks like there was something specific with HuggingFaceBgeEmbeddings as well - could you confirm 0.0.332 with the chroma fix addresses this, and reopen if it's something that needs to be addressed in hugging face?

jayant-yadav commented 8 months ago

@efriis The fix in 0.0.332 works! Now langchain's latest version (0.0.332) is compatible with chromadb==0.4.16. If possible, i would like to know where were the changes made to fix this issue?

efriis commented 8 months ago

13085

BharatBindage commented 7 months ago

Actually looks like there was something specific with HuggingFaceBgeEmbeddings as well - could you confirm 0.0.332 with the chroma fix addresses this, and reopen if it's something that needs to be addressed in hugging face?

This is fixed now .. Thank you @efriis

xsuryanshx commented 7 months ago

Langchain 0.0.332 released with the chroma team's fix for compatibility with chromadb>=0.4.16

Can now use latest of both pip install -U langchain chromadb

thanks this fixed my error!

jashshah commented 1 month ago

I am using chromadb-0.5.0 and langchain-0.2.1 and I still run into this error when I try to host ChromaDB using a docker container.


hf = HuggingFaceBgeEmbeddings(
    model_name=modelPath,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs, 
    cache_folder="./cache"
)

chroma_client = chromadb.HttpClient(host='localhost', port=8000)

collection = chroma_client.create_collection(name="DATA_V3",embedding_function=hf)