having multiple vectors per document which can be searched in the same knn operation

pommedeterresautee commented 4 years ago

Transformer models are limited to 512 tokens but may provide high quality embeddings for semantic search compared to classical word embeddings. For long documents (over 512 tokens), it's usual to split them in blocks < 512 tokens and work at the level of a single block.

My use case is to perform a semantic search across those long documents and find the most semantically related one.

In the current implementation of KNN in open distro, we can provide several vectors per document but :

the number of vectors per document is limited by the mapping,
we can't perform a search across all vector fields in a single operation.

I have thought to 2 workarounds:

split documents and index each blocks and its vector as a document: it makes every thing much more complex to maintain (as it force us to maintain a second index with the original full document), and simplicity was the very reason to try open distro vs building an nmslib/FAISS index outside of elasticsearch.
declare 10 or more KNN vector fields per document in the mapping, populate only the required fields (most of the time, most of the fields will be kept empty), and during the search, launch 10 searches, 1 per field, retrieve top k docs per search, concatenate results, sort per cosine, keep top k. Here, the main issue is performance, it may be quite slow. Again, in this case, it seems better to manage a vector index outside of elasticsearch.

Is there another way to manage long documents ?

vamshin commented 4 years ago

Hi @pommedeterresautee,

>>> `1. the number of vectors per document is limited by the mapping,`

This can be achieved using dynamic templates . You could dynamically define fields of type knn_vector.

Example:- To declare all the fields that begin with name vsearch to be of type knn_vector with 2 dimensions, you could create index using dynamic templates this way

curl -X PUT "localhost:9200/myindex" -H 'Content-Type: application/json' -d'
{
  "settings" : {
    "number_of_shards" :   1,
    "number_of_replicas" : 0,
    "index": {
        "knn": true
    }
  },
  "mappings": {
    "dynamic_templates": [
        {
          "test_template" : {
            "path_match" : "vsearch*",
            "mapping" : {
              "dimension" : 2,
              "type" : "knn_vector"
            }
          }
        }
    ]
}
}
'

>>> `2. we can't perform a search across all vector fields in a single operation.`

Work around to do search across multiple knn fields and combine the results

Assuming fields my_dense_vector1 with 2 dimensions, my_dense_vector2 with 3 dimensions. You could define weightage for the scores for each of the fields.

curl -X POST "localhost:9200/my_dense_index/_search" -H 'Content-Type: application/json' -d'
 {
   "query": {
     "bool": {
       "should": [
         {
           "function_score": {
             "query": {
               "knn": {
                   "my_dense_vector1": {
                   "vector": [0, 0],
                   "k": 1
                   }        
               }
           },
             "weight": 1
           }
         },
         {
           "function_score": {
             "query": {
               "knn": {
                   "my_dense_vector2": {
                   "vector": [0, 0, 0],
                   "k": 1
                   }        
               }
           },
             "weight": 1
           }
         }
       ]
     }
   }
 }
 '

ezorita commented 3 years ago

@pommedeterresautee I find myself in the exact same situation. Which approach did you take finally? Have you had the chance to evaluate the performance loss of searching over multiple vector fields? Thanks.

opendistro-for-elasticsearch / k-NN