Closed rendyfebry closed 1 year ago
Answer generated by a š¤
Hello,
Your understanding is correct. The PGVectorStore
uses Euclidean Distance (L2 distance) as a measure of dissimilarity, not similarity. Lower values of Euclidean distance indicate higher similarity. Therefore, the order of similarity scores from smaller to higher is the correct approach. This is evident in the _build_query
method where the results are ordered by the Euclidean distance in ascending order:
stmt = select( # type: ignore
self.table_class, self.table_class.embedding.l2_distance(embedding)
).order_by(self.table_class.embedding.l2_distance(embedding))
This means that the results with the smallest Euclidean distance (and hence the highest similarity) will be returned first.
On the other hand, the SimilarityPostprocessor
in LlamaIndex is designed to interpret the similarity scores such that a higher score indicates higher similarity. It does not consider lower scores as higher similarity. This is evident from the postprocess_nodes
method in the SimilarityPostprocessor
class.
In this method, if a similarity cutoff exists, it checks if the similarity score of a node is less than the cutoff. If it is, the node is not used (should_use_node = False
). This implies that nodes with scores higher than the cutoff are considered more similar.
Here is the relevant code:
class SimilarityPostprocessor(BasePydanticNodePostprocessor):
"""Similarity-based Node processor."""
similarity_cutoff: float = Field(default=None)
def postprocess_nodes(
self,
nodes: List[NodeWithScore],
query_bundle: Optional[QueryBundle] = None,
) -> List[NodeWithScore]:
"""Postprocess nodes."""
sim_cutoff_exists = self.similarity_cutoff is not None
new_nodes = []
for node in nodes:
should_use_node = True
if sim_cutoff_exists:
similarity = node.score
if similarity is None:
should_use_node = False
elif cast(float, similarity) < cast(float, self.similarity_cutoff):
should_use_node = False
if should_use_node:
new_nodes.append(node)
return new_nodes
So, the SimilarityPostprocessor
and PGVectorStore
are designed to work together correctly. The PGVectorStore
returns nodes ordered by similarity (lower Euclidean distance means higher similarity), and the SimilarityPostprocessor
correctly interprets these scores.
I hope this clarifies your question. If you have any more questions, feel free to ask.
Best, Dosu
This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a š if this is helpful and š if it is irrelevant.
@dosu-beta
The problem is the PGVectorStore uses the Euclidean distance number as the similarity score, as you can see in the PGVectorStore._query_with_score()
function
def _query_with_score(
self,
embedding: Optional[List[float]],
limit: int = 10,
metadata_filters: Optional[MetadataFilters] = None,
) -> List[DBEmbeddingRow]:
stmt = self._build_query(embedding, limit, metadata_filters)
with self._session() as session:
with session.begin():
res = session.execute(stmt)
return [
DBEmbeddingRow(
node_id=item.node_id,
text=item.text,
metadata=item.metadata_,
similarity=sim,
)
for item, sim in res.all()
]
Due to this logic, the first returned node will have the lowest score because it uses the Euclidean distance as the similarity score. This is not expected, because when we query the index, we expect to get the node with the highest score first.
Answer generated by a š¤
Hello,
The PGVectorStore._query_with_score()
function uses the Euclidean distance (L2 distance) as the similarity score because it is a common measure of similarity in vector space models. In this context, the Euclidean distance between two vectors (embeddings) represents how similar they are. The smaller the distance, the more similar the vectors are.
The function returns the nodes in ascending order of their similarity scores (Euclidean distances). This is because in the context of Euclidean distances, a smaller distance means a higher similarity. Therefore, the first returned node has the lowest score (smallest distance), which means it is the most similar to the query.
If you want to return the node with the highest score first, you can modify the _build_query
function to order the results in descending order of their similarity scores. However, this would mean that the node with the largest Euclidean distance (least similar) would be returned first. If you want to maintain the concept of returning the most similar node first but with a higher score, you could use a different similarity measure that increases with similarity, such as cosine similarity.
Here's how you can modify the _build_query
function to order by descending similarity:
def _build_query(
self,
embedding: Optional[List[float]],
limit: int = 10,
metadata_filters: Optional[MetadataFilters] = None,
) -> Any:
import sqlalchemy
from sqlalchemy import select
stmt = select( # type: ignore
self.table_class, self.table_class.embedding.l2_distance(embedding)
).order_by(self.table_class.embedding.l2_distance(embedding).desc()) # Order by descending similarity
if metadata_filters:
for filter_ in metadata_filters.filters:
bind_parameter = f"value_{filter_.key}"
stmt = stmt.where( # type: ignore
sqlalchemy.text(f"metadata_->>'{filter_.key}' = :{bind_parameter}")
)
stmt = stmt.params( # type: ignore
**{bind_parameter: str(filter_.value)}
)
return stmt.limit(limit) # type: ignore
Please note that this change will affect the logic of your application and may not be what you want if you still want the most similar nodes to be returned first.
Let me know if you have any other questions or need further clarification.
Best, Dosu
This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a š if this is helpful and š if it is irrelevant.
@rendyfebry hmmm, just tried out a few queries.
I don't think it's as simple as it seems. Most queries seem to return a "similarity" of around 0.5-0.6. Like copy/pasting an exact sentence and querying with that returns 0.5
Are there any details you could find about how l2 is actually calculated? We just need to normalize it to a scale of 0-1
@rendyfebry this seems to work better? But the similarity is now inverted but makes more sense
stmt = select( # type: ignore
self.table_class, self.table_class.embedding.cosine_distance(embedding)
).order_by(self.table_class.embedding.cosine_distance(embedding))
Thanks' @logan-markewich
I'm not sure whether to use the Euclidean Distance
or Cosine Distance
, from a quick Google search, both seem to work.
But the problem is actually on the scoring scale, since both are distance indicators, then the scale is still from 0 to 1, where a smaller number indicates better similarity. So we still need to fix the score. (as we expect a higher score indicate better similarity)
similarity = ( 1 - distance )
I see your PR already covers that, which is good. And turns out the same logic actually used on other VectorStore also:
Anyway, I made a small PR to improve the clarity: https://github.com/jerryjliu/llama_index/pull/7226
Question Validation
Question
score
onNodeWithScore
indicates the similarity between the node content and the query. The score itself is in the range from 0 to 1, the higher the number, the more similar they are. (This is only my assumption based on this snippet, please CMIIW)filter nodes below 0.75 similarity score
processor = SimilarityPostprocessor(similarity_cutoff=0.75) filtered_nodes = processor.postprocess_nodes(nodes)
stmt = select( # type: ignore self.table_class, self.table_class.embedding.l2_distance(embedding) ).order_by(self.table_class.embedding.l2_distance(embedding))