Closed pommedeterresautee closed 3 years ago
Hi @pommedeterresautee,
1. the number of vectors per document is limited by the mapping,
This can be achieved using dynamic templates . You could dynamically define fields of type knn_vector.
Example:-
To declare all the fields that begin with name vsearch
to be of type knn_vector
with 2 dimensions, you could create index using dynamic templates this way
curl -X PUT "localhost:9200/myindex" -H 'Content-Type: application/json' -d'
{
"settings" : {
"number_of_shards" : 1,
"number_of_replicas" : 0,
"index": {
"knn": true
}
},
"mappings": {
"dynamic_templates": [
{
"test_template" : {
"path_match" : "vsearch*",
"mapping" : {
"dimension" : 2,
"type" : "knn_vector"
}
}
}
]
}
}
'
2. we can't perform a search across all vector fields in a single operation.
Work around to do search across multiple knn fields and combine the results
Assuming fields my_dense_vector1 with 2 dimensions, my_dense_vector2 with 3 dimensions. You could define weightage for the scores for each of the fields.
curl -X POST "localhost:9200/my_dense_index/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"should": [
{
"function_score": {
"query": {
"knn": {
"my_dense_vector1": {
"vector": [0, 0],
"k": 1
}
}
},
"weight": 1
}
},
{
"function_score": {
"query": {
"knn": {
"my_dense_vector2": {
"vector": [0, 0, 0],
"k": 1
}
}
},
"weight": 1
}
}
]
}
}
}
'
@pommedeterresautee I find myself in the exact same situation. Which approach did you take finally? Have you had the chance to evaluate the performance loss of searching over multiple vector fields? Thanks.
Transformer models are limited to 512 tokens but may provide high quality embeddings for semantic search compared to classical word embeddings. For long documents (over 512 tokens), it's usual to split them in blocks < 512 tokens and work at the level of a single block.
My use case is to perform a semantic search across those long documents and find the most semantically related one.
In the current implementation of KNN in open distro, we can provide several vectors per document but :
I have thought to 2 workarounds:
Is there another way to manage long documents ?