run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.99k stars 5.3k forks source link

[Question]: Inconsistent results of embedding between llamaindex and huggingfacd API #16620

Closed Kai-Chen00 closed 1 month ago

Kai-Chen00 commented 1 month ago

Question Validation

Question

Below are two pieces of codes:

  1. the first one is the code that is used to retrieve the most similar top k by llamaindex API:
    
    import chromadb
    from llama_index.vector_stores.chroma import ChromaVectorStore
    from llama_index.core.node_parser import SentenceSplitter
    from llama_index.core import StorageContext, Settings, VectorStoreIndex
    from chromadb.utils import embedding_functions
    from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="dunzhang/stella_en_1.5B_v5", revision="038b58da67ca1432ed75b0e4a44bba403020d38a") Settings.node_parser = parser chroma_client = chromadb.PersistentClient() try: chroma_collection = chroma_client.create_collection("entity_collection", metadata={"hnsw:space": "cosine"}) except Exception as e: if "already exists" in str(e): chroma_collection = chroma_client.get_collection("entity_collection") else: raise e

vector_store = ChromaVectorStore(chroma_collection=chroma_collection) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents(entity_doc, storage_context=storage_context, embed_model=embed_model, show_progress=True)

retriever = index.as_retriever(similarity_top_k=30) nodes = retriever.retrieve("holistic care")

2. the second codes blow testing the performance of given embedding model by purely huggingface API:

import numpy as np from sentence_transformers import SentenceTransformer

def cosine_similarity(vec1, vec2): dot_product = np.dot(vec1, vec2) norm_vec1 = np.linalg.norm(vec1) norm_vec2 = np.linalg.norm(vec2) return dot_product / (norm_vec1 * norm_vec2)

queries = [ "holistic care", ] docs = [ "holistic care", "Physical care", "Psychological care", "Social and family care", "Spiritual and cultural care", "Effective ways of communicating", "Assisting patients", "patients to make decisions", "emotional reassurance and support", "patients advocat", "Effective ways of communicating" ]

model_test = SentenceTransformer("dunzhang/stella_en_1.5B_v5", trust_remote_code=True).cuda() query_embeddings = model_test.encode(queries) doc_embeddings = model_test.encode(docs) similarities = model_test.similarity(query_embeddings, doc_embeddings) print(similarities)


the result of the second piece of code is shown as below cell

tensor([[1.0000, 0.6447, 0.6307, 0.6184, 0.6823, 0.4400, 0.5293, 0.6325, 0.6819, 0.6945, 0.4400]])



So from the second piece of codes, we can at least find that for the identical entity (phrase), such as the "holistic care" as query in the example, the cosine similarity with itself will be 1, which meet our knowledge. 
However, for the first piece of code, I used the identical embedding model, and also "holistic care" as a query to retrieve relevant text from chromaDB (I'm very sure there is a "holistic care" as a chunk in the database), but I cannot retrieve "holistic care" out, although it can (depends on how many chunks I'd like to retrieve), I always get very low similarity among them. What happened? is there any problem of my first piece of codes? 
dosubot[bot] commented 1 month ago

The discrepancy you're experiencing between the LlamaIndex and HuggingFace API results could be due to several factors:

  1. Embedding Model Configuration: Ensure that the HuggingFaceEmbedding in LlamaIndex is configured identically to the SentenceTransformer in your second code snippet. Differences in parameters like pooling strategy or normalization can affect the embeddings.

  2. Similarity Calculation: In the LlamaIndex setup, the similarity scores are derived from distances using a specific formula (similarity_score = math.exp(-distance)) [1]. This might differ from the direct cosine similarity calculation used in your second code snippet.

  3. ChromaDB Configuration: Check the configuration of your ChromaDB collection, especially the metadata and indexing settings. Ensure that the space is set to "cosine" as intended, and that the collection is correctly populated with the embeddings.

  4. Node Parsing and Chunking: The node_parser in LlamaIndex might be affecting how the text is chunked and stored. Ensure that the chunks are correctly parsed and stored in the database, as this can impact retrieval results [2].

  5. Query Parameters: Verify the parameters used in the retrieval process, such as similarity_top_k, to ensure they align with your expectations for the number of results and their similarity scores [3].

By reviewing these aspects, you should be able to identify the cause of the inconsistency and adjust your setup accordingly.

To continue talking to Dosu, mention @dosu.

Kai-Chen00 commented 1 month ago

@dosu, So any problem with my codes?

dosubot[bot] commented 1 month ago

The issues in your code might stem from several areas:

  1. Embedding Model Configuration: Ensure that the HuggingFaceEmbedding in LlamaIndex is configured the same way as the SentenceTransformer in your second code snippet. Differences in settings like pooling strategy or normalization can lead to different embeddings.

  2. Similarity Calculation: LlamaIndex might use a different method to convert distances to similarity scores compared to the direct cosine similarity calculation in your second code snippet. This could affect the similarity scores you observe.

  3. ChromaDB Configuration: Double-check the configuration of your ChromaDB collection, especially the metadata and indexing settings. Make sure the space is set to "cosine" and that the collection is correctly populated with the embeddings.

  4. Node Parsing and Chunking: The node_parser in LlamaIndex might affect how text is chunked and stored. Ensure that the text is correctly parsed and stored in the database, as this can impact retrieval results [1][2].

  5. Query Parameters: Verify the parameters used in the retrieval process, such as similarity_top_k, to ensure they align with your expectations for the number of results and their similarity scores [3].

Reviewing these areas should help you identify and resolve the inconsistencies in your results.

To continue talking to Dosu, mention @dosu.

Kai-Chen00 commented 1 month ago

Is there anyone who can help me look into this problem? I have been stuck on it for almost a week. Any suggestions are welcome!

Kai-Chen00 commented 1 month ago

Is there anyone who can help me look into this problem? I have been stuck on it for almost a week. Any suggestions are welcome!

Solved! metadata of each node will make the retrieval abnormal, not sure why