lior-k / fast-elasticsearch-vector-scoring

Score documents using embedding-vectors dot-product or cosine-similarity with ES Lucene engine
Apache License 2.0
395 stars 112 forks source link

Cosine Similarities are not proper #18

Closed osmantamer closed 5 years ago

osmantamer commented 5 years ago

Hello,

I was trying to measure the cosine similarity between vectors with dimensions (1, 128). Here is the query.

` { "query": { "function_score": { "boost_mode": "replace", "min_score": 0, "script_score": { "script": { "source": "binary_vector_score", "lang": "painless", "params": { "cosine": true, "field": "embedding_vector", "vector": [

] } } } } }, "size": 170 }` But results are meaningless. Eventhough two vectors is not similar, it's score is over 0.9. Similarity scores decrease gradually.
durakkerem commented 5 years ago

I am also having the same problem. Scores does not reflect actual cosine similarities.

durakkerem commented 5 years ago

did you resolve the problem?

osmantamer commented 5 years ago

did you resolve the problem?

unfortunately.

lior-k commented 5 years ago

Hi guys, I'm sorry for the late response. Cosine scoring works fine for us. I added a test for the scores. feel free to change it, test it & submit it to me if it fails.

@osmantamer Note that the query you pasted is wrong. it should state "lang": "knn" from the readme:

{
  "query": {
    "function_score": {
      "boost_mode": "replace",
      "script_score": {
        "script": {
          "inline": "binary_vector_score",
          "lang": "knn",
          "params": {
            "cosine": false,
            "field": "embedding_vector",
            "vector": [
               -0.09217305481433868, 0.010635560378432274, -0.02878434956073761, 0.06988169997930527, 0.1273992955684662, -0.023723633959889412, 0.05490724742412567, -0.12124507874250412, -0.023694118484854698, 0.014595639891922474, 0.1471538096666336, 0.044936809688806534, -0.02795785665512085, -0.05665992572903633, -0.2441125512123108, 0.2755320072174072, 0.11451690644025803, 0.20242854952812195, -0.1387604922056198, 0.05219579488039017, 0.1145530641078949, 0.09967200458049774, 0.2161576747894287, 0.06157230958342552, 0.10350126028060913, 0.20387393236160278, 0.1367097795009613, 0.02070528082549572, 0.19238869845867157, 0.059613026678562164, 0.014012521132826805, 0.16701748967170715, 0.04985826835036278, -0.10990987718105316, -0.12032567709684372, -0.1450948715209961, 0.13585780560970306, 0.037511035799980164, 0.04251480475068092, 0.10693439096212387, -0.08861573040485382, -0.07457160204648972, 0.0549330934882164, 0.19136285781860352, 0.03346432000398636, -0.03652812913060188, -0.1902569830417633, 0.03250952064990997, -0.3061246871948242, 0.05219300463795662, -0.07879918068647385, 0.1403723508119583, -0.08893408626317978, -0.24330253899097443, -0.07105310261249542, -0.18161986768245697, 0.15501035749912262, -0.216160386800766, -0.06377710402011871, -0.07671763002872467, 0.05360138416290283, -0.052845533937215805, -0.02905619889497757, 0.08279753476381302
             ]
          }
        }
      }
    }
  },
  "size": 100
}
pfeiffer commented 5 years ago

I'm also having trouble with the scoring and scoring does not seem to reflect the distance.

Did any of you manage to solve it?

I'd love to contribute with a failing test case - how can I run the tests? :-)

osmantamer commented 5 years ago

I'm also having trouble with the scoring and scoring does not seem to reflect the distance.

Did any of you manage to solve it?

I realize that it was my mistake. In python, compare_faces function of the face_recognition library return the euclidean distance between two faces, (smaller value means similar faces), but in this plug-in it returns cosine similarity (higher value means similar faces). That was my mistake.

lior-k commented 5 years ago

Glad to hear. Closing this