lior-k / fast-elasticsearch-vector-scoring

Score documents using embedding-vectors dot-product or cosine-similarity with ES Lucene engine
Apache License 2.0
395 stars 112 forks source link

Incorrect scores for cosine similarity #40

Closed sully90 closed 4 years ago

sully90 commented 4 years ago

Hi there, I'm trying to use the plugin with Elasticsearch 6.8.1 but getting strange document scores when doing cosine similarity queries.

I have the following query:

{
    "query": {
        "function_score": {
            "boost_mode": "replace",
            "script_score": {
                "script": {
                    "source": "binary_vector_score",
                    "lang": "knn",
                    "params": {
                        "cosine": true,
                        "field": "embedding_vector",
                        "encoded_vector": "PVC0BjxrQug8VcbbvRqFmrz5KQy9aQ2ku7g0eL1AxVo+KQK6vGHZJz56bne9YwQNvoFb7r21yio8TIP4PZH8b71ksh+8NetEvh/vhL3+gL89pUxfPfdQ8D3j8j47/BHwvio1dL62+ak5LioAvgslND2Hy9O+BvLdvAepBD3P1fW+MjrwvaWz+D1mjWw9vSibvUqxBL09jYM+Nx6FPfTthr4safw9LmFMvT+ZCD5mid0+M9wxPUQ3VD2K8kK+Lrv2PdsAdr6FzIg9pDl3Pd+3/j2C0Q67Vw6gPiMK1b4V43o7RwUAPd8Q9b6ZqnE9wArBPY9d0rx5qUS9xDpQvdPjKj4BZSc9rvZMvb3B373G6eg+JhPOvjg0o72BcnU9/Up6vYwqzr5VTkK+oOwOPUIGXD6mofU9/95Kvkt2/DvSv0i8o6poPKxrKD3gzTA7JUFAvZRJdjnaugC9DVqJPQkg1D4VoQW9JIsQvYnK0D5oMQe9XsByPVlXdj1Hoxk866z4vgtqyjyUp/y+DROkuwS1QD3OKlS+GaOBPPq92D1Xq8i+WVwqPgsmPL2CFY2+AuraPV35uL1VjGC9+bikPKydSD5Rk+K+oYchPoe5ez5bYWE9wBhcPgm1GT1L43A8mKqIPA+NULw6xgC+T0Hova0b1jxpyYy8/RdsvKX5OLycdjA="
                    }
                }
            }
        }
    },
    "size": 100
}

Which returns the following hit:

{
    "_index": "test_index",
    "_type": "_doc",
    "_id": "wJUwGnABOtCXXTPakSP3",
    "_score": 0.94501674,
    "_source": {
        ...
        "embedding_vector": "vRfgbD01DzG9RhYYvVYFdb0z/i69n1VGvEj+Qr3nw3g+PAt3vZZsHz3uAA48eLY0vmkUzb0Lrmy9U2W4PGLMdb4gdSi9lorsvWP/rL4hyg09s1xEPQQPfj2RYjI9FC7avgOK8r6EPzK95e7IvilbiDzxH3S9xA2EPYD4Sj1rkVG+VO7qvZL7Lr0JLYY8wf2yPMpmxDzN44A9+Is6O572hr29T7Q8mEXYPM7/xD6DwzU+SUVZPbrdizznYfG9q4yiPcuGaL5BUDo9PwzWPeRaNz2BUIw85y1sPX/2dL5SsfI82nFgPRG17L5Pi+Y92SyIPXEezr0gcn69L2QpPNajKD5FR3k9FlwLveHtVL2vyNY+E5KOvl6xXztcz4A9qd25vfodoL4y2a6+jZO8PRmFhD7Uxx4+B+lGvZVW4z2MvMC+GtDwPSvnhD2ADPS6/DgAvcmDiTx8j4C982oePbJ9gD4EQj+85ABIveBOKD4nGse9Ik2oPQCXojzcDsw9x9oeviKutbwopsi982++u5f1QD1JZIy97Ul5vBff4D2hK7O+F1woPgWhF7zE2+y9hXfCvbxKDzyM2Wi+AF2gO7XkYD4eYD6+jTnoPmmgqz5GX3477NuIPkzHJDw1ZzA9dUM6vT2p+rzXI4i+AppVvcY047w2xCg9AflKPaailDydrYg="
    }
}

The score returned from the plugin is 0.94501674, however if you decode both vectors (using python code provided in the README) and compute the cosine similarity using the below function then the actual (correct) answer is 0.8900337068367593:

import numpy as np
from numpy.linalg import norm

def cosine_sim(vec1, vec2):
    """
    Computes the cosine similarity between two vectors
    :param vec1:
    :param vec2:
    :return:
    """
    cos_sim = np.dot(vec1, vec2) / \
        (norm(vec1) * norm(vec2))
    return cos_sim

I've tried the same with Elasticsearch 7.5.0 (and appropriate version of the plugin) and get the same result. I'm using the following Dockerfile to build/install the plugin and run Elasticsearch:

FROM maven:3.5-jdk-8-alpine AS build
COPY fast-elasticsearch-vector-scoring /opt/fast-elasticsearch-vector-scoring
RUN cd /opt/fast-elasticsearch-vector-scoring && mvn package

FROM elasticsearch:6.8.1
COPY --from=build /opt/fast-elasticsearch-vector-scoring/target/releases/elasticsearch-binary-vector-scoring-6.8.1.zip /plugins/elasticsearch-binary-vector-scoring-6.8.1.zip

# Set development mode ENV variables
ENV xpack.security.enabled=false
ENV discovery.type=single-node

# Install the plugin
RUN /usr/share/elasticsearch/bin/elasticsearch-plugin install file:///plugins/elasticsearch-binary-vector-scoring-6.8.1.zip

Any ideas why the score is inconsistent with the calculated value? Any help is greatly appreciated!

lior-k commented 4 years ago

This is by design. see why in the readme file:

sully90 commented 4 years ago

Ah, I'm sure the README said this was just Elasticsearch 7 initially so I tried 6.8.1. Thanks for clearing this up :)