Closed Kai-Chen00 closed 1 month ago
The discrepancy you're experiencing between the LlamaIndex and HuggingFace API results could be due to several factors:
Embedding Model Configuration: Ensure that the HuggingFaceEmbedding
in LlamaIndex is configured identically to the SentenceTransformer
in your second code snippet. Differences in parameters like pooling strategy or normalization can affect the embeddings.
Similarity Calculation: In the LlamaIndex setup, the similarity scores are derived from distances using a specific formula (similarity_score = math.exp(-distance)
) [1]. This might differ from the direct cosine similarity calculation used in your second code snippet.
ChromaDB Configuration: Check the configuration of your ChromaDB collection, especially the metadata and indexing settings. Ensure that the space is set to "cosine" as intended, and that the collection is correctly populated with the embeddings.
Node Parsing and Chunking: The node_parser
in LlamaIndex might be affecting how the text is chunked and stored. Ensure that the chunks are correctly parsed and stored in the database, as this can impact retrieval results [2].
Query Parameters: Verify the parameters used in the retrieval process, such as similarity_top_k
, to ensure they align with your expectations for the number of results and their similarity scores [3].
By reviewing these aspects, you should be able to identify the cause of the inconsistency and adjust your setup accordingly.
To continue talking to Dosu, mention @dosu.
@dosu, So any problem with my codes?
The issues in your code might stem from several areas:
Embedding Model Configuration: Ensure that the HuggingFaceEmbedding
in LlamaIndex is configured the same way as the SentenceTransformer
in your second code snippet. Differences in settings like pooling strategy or normalization can lead to different embeddings.
Similarity Calculation: LlamaIndex might use a different method to convert distances to similarity scores compared to the direct cosine similarity calculation in your second code snippet. This could affect the similarity scores you observe.
ChromaDB Configuration: Double-check the configuration of your ChromaDB collection, especially the metadata and indexing settings. Make sure the space is set to "cosine" and that the collection is correctly populated with the embeddings.
Node Parsing and Chunking: The node_parser
in LlamaIndex might affect how text is chunked and stored. Ensure that the text is correctly parsed and stored in the database, as this can impact retrieval results [1][2].
Query Parameters: Verify the parameters used in the retrieval process, such as similarity_top_k
, to ensure they align with your expectations for the number of results and their similarity scores [3].
Reviewing these areas should help you identify and resolve the inconsistencies in your results.
To continue talking to Dosu, mention @dosu.
Is there anyone who can help me look into this problem? I have been stuck on it for almost a week. Any suggestions are welcome!
Is there anyone who can help me look into this problem? I have been stuck on it for almost a week. Any suggestions are welcome!
Solved! metadata of each node will make the retrieval abnormal, not sure why
Question Validation
Question
Below are two pieces of codes:
embed_model = HuggingFaceEmbedding(model_name="dunzhang/stella_en_1.5B_v5", revision="038b58da67ca1432ed75b0e4a44bba403020d38a") Settings.node_parser = parser chroma_client = chromadb.PersistentClient() try: chroma_collection = chroma_client.create_collection("entity_collection", metadata={"hnsw:space": "cosine"}) except Exception as e: if "already exists" in str(e): chroma_collection = chroma_client.get_collection("entity_collection") else: raise e
vector_store = ChromaVectorStore(chroma_collection=chroma_collection) storage_context = StorageContext.from_defaults(vector_store=vector_store) index = VectorStoreIndex.from_documents(entity_doc, storage_context=storage_context, embed_model=embed_model, show_progress=True)
retriever = index.as_retriever(similarity_top_k=30) nodes = retriever.retrieve("holistic care")
import numpy as np from sentence_transformers import SentenceTransformer
def cosine_similarity(vec1, vec2): dot_product = np.dot(vec1, vec2) norm_vec1 = np.linalg.norm(vec1) norm_vec2 = np.linalg.norm(vec2) return dot_product / (norm_vec1 * norm_vec2)
queries = [ "holistic care", ] docs = [ "holistic care", "Physical care", "Psychological care", "Social and family care", "Spiritual and cultural care", "Effective ways of communicating", "Assisting patients", "patients to make decisions", "emotional reassurance and support", "patients advocat", "Effective ways of communicating" ]
model_test = SentenceTransformer("dunzhang/stella_en_1.5B_v5", trust_remote_code=True).cuda() query_embeddings = model_test.encode(queries) doc_embeddings = model_test.encode(docs) similarities = model_test.similarity(query_embeddings, doc_embeddings) print(similarities)
tensor([[1.0000, 0.6447, 0.6307, 0.6184, 0.6823, 0.4400, 0.5293, 0.6325, 0.6819, 0.6945, 0.4400]])