run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.54k stars 5.02k forks source link

[Question][Bug]: Similarity Score and Order in PGVectorStore #7214

Closed rendyfebry closed 1 year ago

rendyfebry commented 1 year ago

Question Validation

Question

  1. As per my understanding, the score on NodeWithScore indicates the similarity between the node content and the query. The score itself is in the range from 0 to 1, the higher the number, the more similar they are. (This is only my assumption based on this snippet, please CMIIW)
    
    nodes = [
    NodeWithScore(node=Node(text="text"), score=0.7),
    NodeWithScore(node=Node(text="text"), score=0.8)
    ]

filter nodes below 0.75 similarity score

processor = SimilarityPostprocessor(similarity_cutoff=0.75) filtered_nodes = processor.postprocess_nodes(nodes)


2. If the above assumption is correct, then when querying the index, we expect to get nodes with scores ordered from **HIGHER TO LOWER**.

3. On `PGVectorStore`, we [query](https://github.com/jerryjliu/llama_index/blob/main/llama_index/vector_stores/postgres.py#L170...L172) with a simple `.order_by()` clause, which results in ascending order, so the scores would be ordered from **LOWER TO HIGHER**.

stmt = select( # type: ignore self.table_class, self.table_class.embedding.l2_distance(embedding) ).order_by(self.table_class.embedding.l2_distance(embedding))


4. But if we look further, the query uses the [l2_distance function from PgVector](https://github.com/pgvector/pgvector#vector-functions) which returns the Euclidean Distance. I'm not familiar with this algorithm, but from the name "distance" I assume shorter distance means higher similarity. So from the Euclidean Distance pov, we actually expect the result to be sorted by l2_distances from **LOWER TO HIGHER**.

![Euclidean Distance](https://github.com/jerryjliu/llama_index/assets/1105460/2763025d-2f1a-4a48-b6ec-d827f344dca8)

---

TL: DR So at the end the order of the results is still as expected, it returns nodes from the most similar to least similar (shorter l2_distance to the higher ones), but since the `PGVectorStore` uses Euclidean Distance as the similarity score, the similarities score order is reversed, from smaller to higher, which will mess up the NodePostProcessor logic.

Is my understanding correct? If yes, can we fix the similarity score calculation on the `PGVectorStore`?
dosubot[bot] commented 1 year ago

Answer generated by a šŸ¤–

Answer

Hello,

Your understanding is correct. The PGVectorStore uses Euclidean Distance (L2 distance) as a measure of dissimilarity, not similarity. Lower values of Euclidean distance indicate higher similarity. Therefore, the order of similarity scores from smaller to higher is the correct approach. This is evident in the _build_query method where the results are ordered by the Euclidean distance in ascending order:

stmt = select(  # type: ignore
    self.table_class, self.table_class.embedding.l2_distance(embedding)
).order_by(self.table_class.embedding.l2_distance(embedding))

This means that the results with the smallest Euclidean distance (and hence the highest similarity) will be returned first.

On the other hand, the SimilarityPostprocessor in LlamaIndex is designed to interpret the similarity scores such that a higher score indicates higher similarity. It does not consider lower scores as higher similarity. This is evident from the postprocess_nodes method in the SimilarityPostprocessor class.

In this method, if a similarity cutoff exists, it checks if the similarity score of a node is less than the cutoff. If it is, the node is not used (should_use_node = False). This implies that nodes with scores higher than the cutoff are considered more similar.

Here is the relevant code:

class SimilarityPostprocessor(BasePydanticNodePostprocessor):
    """Similarity-based Node processor."""

    similarity_cutoff: float = Field(default=None)

    def postprocess_nodes(
        self,
        nodes: List[NodeWithScore],
        query_bundle: Optional[QueryBundle] = None,
    ) -> List[NodeWithScore]:
        """Postprocess nodes."""
        sim_cutoff_exists = self.similarity_cutoff is not None

        new_nodes = []
        for node in nodes:
            should_use_node = True
            if sim_cutoff_exists:
                similarity = node.score
                if similarity is None:
                    should_use_node = False
                elif cast(float, similarity) < cast(float, self.similarity_cutoff):
                    should_use_node = False

            if should_use_node:
                new_nodes.append(node)

        return new_nodes

So, the SimilarityPostprocessor and PGVectorStore are designed to work together correctly. The PGVectorStore returns nodes ordered by similarity (lower Euclidean distance means higher similarity), and the SimilarityPostprocessor correctly interprets these scores.

I hope this clarifies your question. If you have any more questions, feel free to ask.

Best, Dosu

Sources


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a šŸ‘ if this is helpful and šŸ‘Ž if it is irrelevant.

rendyfebry commented 1 year ago

@dosu-beta

The problem is the PGVectorStore uses the Euclidean distance number as the similarity score, as you can see in the PGVectorStore._query_with_score() function

    def _query_with_score(
        self,
        embedding: Optional[List[float]],
        limit: int = 10,
        metadata_filters: Optional[MetadataFilters] = None,
    ) -> List[DBEmbeddingRow]:
        stmt = self._build_query(embedding, limit, metadata_filters)
        with self._session() as session:
            with session.begin():
                res = session.execute(stmt)
                return [
                    DBEmbeddingRow(
                        node_id=item.node_id,
                        text=item.text,
                        metadata=item.metadata_,
                        similarity=sim,
                    )
                    for item, sim in res.all()
                ]

Due to this logic, the first returned node will have the lowest score because it uses the Euclidean distance as the similarity score. This is not expected, because when we query the index, we expect to get the node with the highest score first.

dosubot[bot] commented 1 year ago

Answer generated by a šŸ¤–

Answer

Hello,

The PGVectorStore._query_with_score() function uses the Euclidean distance (L2 distance) as the similarity score because it is a common measure of similarity in vector space models. In this context, the Euclidean distance between two vectors (embeddings) represents how similar they are. The smaller the distance, the more similar the vectors are.

The function returns the nodes in ascending order of their similarity scores (Euclidean distances). This is because in the context of Euclidean distances, a smaller distance means a higher similarity. Therefore, the first returned node has the lowest score (smallest distance), which means it is the most similar to the query.

If you want to return the node with the highest score first, you can modify the _build_query function to order the results in descending order of their similarity scores. However, this would mean that the node with the largest Euclidean distance (least similar) would be returned first. If you want to maintain the concept of returning the most similar node first but with a higher score, you could use a different similarity measure that increases with similarity, such as cosine similarity.

Here's how you can modify the _build_query function to order by descending similarity:

def _build_query(
    self,
    embedding: Optional[List[float]],
    limit: int = 10,
    metadata_filters: Optional[MetadataFilters] = None,
) -> Any:
    import sqlalchemy
    from sqlalchemy import select

    stmt = select(  # type: ignore
        self.table_class, self.table_class.embedding.l2_distance(embedding)
    ).order_by(self.table_class.embedding.l2_distance(embedding).desc())  # Order by descending similarity
    if metadata_filters:
        for filter_ in metadata_filters.filters:
            bind_parameter = f"value_{filter_.key}"
            stmt = stmt.where(  # type: ignore
                sqlalchemy.text(f"metadata_->>'{filter_.key}' = :{bind_parameter}")
            )
            stmt = stmt.params(  # type: ignore
                **{bind_parameter: str(filter_.value)}
            )
    return stmt.limit(limit)  # type: ignore

Please note that this change will affect the logic of your application and may not be what you want if you still want the most similar nodes to be returned first.

Let me know if you have any other questions or need further clarification.

Best, Dosu

Sources


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a šŸ‘ if this is helpful and šŸ‘Ž if it is irrelevant.

logan-markewich commented 1 year ago

@rendyfebry hmmm, just tried out a few queries.

I don't think it's as simple as it seems. Most queries seem to return a "similarity" of around 0.5-0.6. Like copy/pasting an exact sentence and querying with that returns 0.5

Are there any details you could find about how l2 is actually calculated? We just need to normalize it to a scale of 0-1

logan-markewich commented 1 year ago

@rendyfebry this seems to work better? But the similarity is now inverted but makes more sense

stmt = select(  # type: ignore
            self.table_class, self.table_class.embedding.cosine_distance(embedding)
        ).order_by(self.table_class.embedding.cosine_distance(embedding))
rendyfebry commented 1 year ago

Thanks' @logan-markewich

I'm not sure whether to use the Euclidean Distance or Cosine Distance, from a quick Google search, both seem to work.

But the problem is actually on the scoring scale, since both are distance indicators, then the scale is still from 0 to 1, where a smaller number indicates better similarity. So we still need to fix the score. (as we expect a higher score indicate better similarity)

similarity = ( 1 - distance )

I see your PR already covers that, which is good. And turns out the same logic actually used on other VectorStore also:

Anyway, I made a small PR to improve the clarity: https://github.com/jerryjliu/llama_index/pull/7226