opendistro-for-elasticsearch / k-NN

🆕 A machine learning plugin which supports an approximate k-NN search algorithm for Open Distro.
https://opendistro.github.io/
Apache License 2.0
277 stars 55 forks source link

having multiple vectors per document which can be searched in the same knn operation #221

Closed pommedeterresautee closed 3 years ago

pommedeterresautee commented 4 years ago

Transformer models are limited to 512 tokens but may provide high quality embeddings for semantic search compared to classical word embeddings. For long documents (over 512 tokens), it's usual to split them in blocks < 512 tokens and work at the level of a single block.

My use case is to perform a semantic search across those long documents and find the most semantically related one.

In the current implementation of KNN in open distro, we can provide several vectors per document but :

I have thought to 2 workarounds:

Is there another way to manage long documents ?

vamshin commented 4 years ago

Hi @pommedeterresautee,

>>> 1. the number of vectors per document is limited by the mapping,

This can be achieved using dynamic templates . You could dynamically define fields of type knn_vector.

Example:- To declare all the fields that begin with name vsearch to be of type knn_vector with 2 dimensions, you could create index using dynamic templates this way

curl -X PUT "localhost:9200/myindex" -H 'Content-Type: application/json' -d'
{
  "settings" : {
    "number_of_shards" :   1,
    "number_of_replicas" : 0,
    "index": {
        "knn": true
    }
  },
  "mappings": {
    "dynamic_templates": [
        {
          "test_template" : {
            "path_match" : "vsearch*",
            "mapping" : {
              "dimension" : 2,
              "type" : "knn_vector"
            }
          }
        }
    ]
}
}
'

>>> 2. we can't perform a search across all vector fields in a single operation.

Work around to do search across multiple knn fields and combine the results

Assuming fields my_dense_vector1 with 2 dimensions, my_dense_vector2 with 3 dimensions. You could define weightage for the scores for each of the fields.

curl -X POST "localhost:9200/my_dense_index/_search" -H 'Content-Type: application/json' -d'
 {
   "query": {
     "bool": {
       "should": [
         {
           "function_score": {
             "query": {
               "knn": {
                   "my_dense_vector1": {
                   "vector": [0, 0],
                   "k": 1
                   }        
               }
           },
             "weight": 1
           }
         },
         {
           "function_score": {
             "query": {
               "knn": {
                   "my_dense_vector2": {
                   "vector": [0, 0, 0],
                   "k": 1
                   }        
               }
           },
             "weight": 1
           }
         }
       ]
     }
   }
 }
 '
ezorita commented 3 years ago

@pommedeterresautee I find myself in the exact same situation. Which approach did you take finally? Have you had the chance to evaluate the performance loss of searching over multiple vector fields? Thanks.