run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
34.71k stars 4.9k forks source link

[Bug]: Scores in the retrieved nodes are in reversed order in the Weaviate integration #14728

Open terilias opened 1 month ago

terilias commented 1 month ago

Bug Description

Hello, I was using the retriever from a vector store index that has been initialized from a Weaviate collection. I noticed that the retrieved nodes have scores in reversed order: the first (most relevant) node, has score equals to zero and as we move to the least relevant nodes, the score increases.

We found in the code that LlamaIndex performs subtraction 1 - score, where score is the score that the Weaviate returns. But the Weaviate now, returns similarity score instead of distance. I think that only in vector (instead of hybrid) search, the distance can be returned instead of similarity (see here). You can use the code I provide below (from a Jupyter Notebook) in order to see the scores that LlamaIndex gives and the scores that Weaviate returns.

Version

llama-index==0.10.53 llama-index-vector-stores-weaviate==1.0.0 weaviate-client==4.6.5

Steps to Reproduce

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, Document
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core.vector_stores import VectorStoreQuery
from llama_index.core.schema import TextNode

from llama_index.embeddings.text_embeddings_inference import TextEmbeddingsInference
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from llama_index.core.node_parser import SimpleNodeParser

import weaviate
import os

from transformers import AutoTokenizer, AutoModel
import tiktoken
import requests
from IPython.display import Markdown, display

# In[ ]:
# Embeddings initialization: OpenAI
embed_model = OpenAIEmbedding(model="text-embedding-3-small", api_key=os.environ.get("OPEN_AI_API_KEY"))
tokenizer = tiktoken.encoding_for_model("text-embedding-3-small").encode

# In[ ]:
tokenizer_obj = tokenizer
# The chunk_size must be compatible with the sequence length of the embed_model_obj that is used.
chunk_size = 450
chunk_overlap = 50
# Initialize a node parser that we will use in the documents parsing.
# First initialize the TokenCountingHandler with our tokenizer and the CallbackManager with our token counter.
# And then the node parser.
token_counter_handler = TokenCountingHandler(tokenizer=tokenizer_obj)
callback_manager = CallbackManager([token_counter_handler])
node_parser = SimpleNodeParser.from_defaults(chunk_size=chunk_size,
                                                  chunk_overlap=chunk_overlap,
                                                  callback_manager=callback_manager)

# In[66]:
client = weaviate.connect_to_local()

# In[127]:
# Now that the collection is already created we just connected to it.
vector_store = WeaviateVectorStore(
    weaviate_client=client, index_name="Test"
)

# In[128]:
vector_store_index = VectorStoreIndex.from_vector_store(vector_store=vector_store,
                                                        embed_model=embed_model,
                                                        transformations=[node_parser],
                                                        show_progress=True)

# In[100]:
def get_wikipedia_article_text(title):
    url = "https://en.wikipedia.org/w/api.php"
    params = {"action": "query", "format": "json", "prop": "extracts", "explaintext": True, "titles": title}
    response = requests.get(url, params=params).json()
    page = next(iter(response["query"]["pages"].values()))
    return page.get("extract", "Article not found.")

python_doc_text = get_wikipedia_article_text("Python (programming language)")
lion_doc_text = get_wikipedia_article_text("Lion")
lion_paragraph = lion_doc_text[:1000]

# In[25]:
python_doc = Document(doc_id='1',
                      text=python_doc_text,
                      metadata={
                           "title_of_parental_document": "Python_(programming_language)",
                           "source": "https://en.wikipedia.org/wiki/Python_(programming_language)"
                       })

# In[101]:
lion_doc = Document(doc_id='2',
                    text=lion_paragraph,
                    metadata={
                       "title_of_parental_document": "Lion",
                       "source": "https://en.wikipedia.org/wiki/Lion"
                   })

# In[104]:
vector_store_index.insert(document=python_doc)
vector_store_index.insert(document=lion_doc)

# In[129]:
retriever = vector_store_index.as_retriever(similarity_top_k=10, 
                                            vector_store_query_mode="hybrid",
                                            alpha=0.5)
nodes = retriever.retrieve("What is lion?")

# In[131]:
# Always the retriever returns a list of nodes in decsending order based on the score (most relevant chunks going first in the list).
# But why here the most relevant chunk has a zero score?
for node in nodes:
    print(node.text)
    print()
    print(node.score)
    print("__________________________________________________________________________________________________________")
    print("__________________________________________________________________________________________________________")

print([node.score for node in nodes])
# The scores are: [0.0,  0.9217832833528519,  0.9288488179445267,  0.9365298748016357,  0.937725093215704,    #0.9396311119198799,  0.9409564286470413,  0.9446112886071205,  0.9455222226679325,  0.9476451091468334]

# In[108]:

# Code to query Weaviate without LlamaIndex.
query = "what is lion?"
query_vector = embed_model.get_query_embedding(query=query)

# In[121]:
# query_vector
from weaviate.classes.query import MetadataQuery

# In[123]:
from weaviate.classes.query import MetadataQuery
response = collection.query.hybrid(
    query=query,
    vector=query_vector,
    return_metadata=MetadataQuery(distance=True,
                             certainty=True,
                             score=True,
                             explain_score=True),
    alpha=0.50,
    limit=10,
)
x=[]
for o in response.objects:
    print(o.properties)
    print()
    print(o.metadata)
    print("_______________")
   x.append(o.metadata.score)

print(x)

# Scores from weaviate:
# [1.0,  0.08082851022481918,  0.07240726053714752,  0.07034952938556671,  0.0660715326666832,  #0.06566160172224045,  0.06270790100097656,  0.05426621064543724,  0.05242578685283661,  0.05184878036379814]

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 1 month ago

The issue arises because LlamaIndex performs a subtraction operation (1 - score) on the scores returned by Weaviate, which now returns similarity scores instead of distances. This results in the scores being reversed.

To fix this, you can modify the code to handle similarity scores correctly. Here’s how you can adjust the scores:

  1. Modify the score calculation: Update the part of the code where the score is calculated to use the similarity score directly without subtracting from 1.

  2. Reverse the order of the retrieved results: Ensure that the results are sorted in descending order of relevance.

Here’s an example of how you can adjust the code:

# Modify the score calculation
def adjust_scores(scores):
    return scores  # Use the similarity scores directly

# Reverse the order of the retrieved results
response = retriever.retrieve("What is lion?")
response = response[::-1]  # Reverse the order of the results
print(response[0])

Additionally, if you are using the VectorContextRetriever from the LlamaIndex framework, you can modify the retrieve_from_graph method to handle the scores correctly:

class VectorContextRetriever(BasePGRetriever):
    # ... (other parts of the class)

    def retrieve_from_graph(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        vector_store_query = self._get_vector_store_query(query_bundle)

        triplets = []
        kg_ids = []
        new_scores = []
        if self._graph_store.supports_vector_queries:
            result = self._graph_store.vector_query(vector_store_query)
            if len(result) != 2:
                raise ValueError("No nodes returned by vector_query")
            kg_nodes, scores = result

            kg_ids = [node.id for node in kg_nodes]
            triplets = self._graph_store.get_rel_map(
                kg_nodes, depth=self._path_depth, ignore_rels=[KG_SOURCE_REL]
            )
        elif self._vector_store is not None:
            query_result = self._vector_store.query(vector_store_query)
            if query_result.nodes is not None and query_result.similarities is not None:
                kg_ids = self._get_kg_ids(query_result.nodes)
                scores = query_result.similarities
                kg_nodes = self._graph_store.get(ids=kg_ids)
                triplets = self._graph_store.get_rel_map(
                    kg_nodes, depth=self._path_depth, ignore_rels=[KG_SOURCE_REL]
                )

            elif query_result.ids is not None and query_result.similarities is not None:
                kg_ids = query_result.ids
                scores = query_result.similarities
                kg_nodes = self._graph_store.get(ids=kg_ids)
                triplets = self._graph_store.get_rel_map(
                    kg_nodes, depth=self._path_depth, ignore_rels=[KG_SOURCE_REL]
                )

        for triplet in triplets:
            score1 = (
                scores[kg_ids.index(triplet[0].id)] if triplet[0].id in kg_ids else 0.0
            )
            score2 = (
                scores[kg_ids.index(triplet[2].id)] if triplet[2].id in kg_ids else 0.0
            )
            new_scores.append(max(score1, score2))

        assert len(triplets) == len(new_scores)

        # sort by score
        top_k = sorted(zip(triplets, new_scores), key=lambda x: x[1], reverse=True)

        return self._get_nodes_with_score([x[0] for x in top_k], [x[1] for x in top_k])[::-1]  # Reverse the order of the results

This modification will ensure that the results are returned in the correct order of relevance directly from the retriever [1][2][3].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 1 month ago

@terilias Seems like you have a decent understanding of the issue. Feel free to open a PR 👍🏻

terilias commented 1 month ago

Hi @logan-markewich , thanks, I would like to give it a try! I have found some more issues in the Weaviate integration and I was thinking to post them for reference, each in a separate thread and try to work on them. Is that ok?

logan-markewich commented 6 days ago

@terilias go for it!