opendistro-for-elasticsearch / k-NN

🆕 A machine learning plugin which supports an approximate k-NN search algorithm for Open Distro.
https://opendistro.github.io/
Apache License 2.0
277 stars 55 forks source link

Postfilter vs prefilter for knn search #263

Closed sujithjoseph closed 3 years ago

sujithjoseph commented 3 years ago

Looking to perform pre-filter for knn search. Does the below query do a prefilter OR is it a postfilter


      "bool" : {
        "filter" : {
              "terms" : {
                "category" : [
                  "a",
                  "b",
                  "c",
                  "d"
                ]
              }
            },
        "must" : {
          "knn" : {
            "doc" : {
              "k" : 10000,
              "vector" : [
                -7.321715E-4....]
                }
              }
              }
              }
            }```

OR does it have to be an explicit score fn like the one below for a pre-filter for cosinesimil knn search

```{
        "query": {
          "script_score": {
            "query": {
                 "filter" : {
              "terms" : {
                "category" : [
                  "a",
                  "b",
                  "c",
                  "d"
                ]
                  }
              }
            },
            "script": {
              "source": "knn_score",
              "lang": "knn",
              "params": {
                "field": "doc",
                "k" : 10000,

                "vector": [
                -7.321715E-4....]
              }
            }
          }
        }
      }```

Also how do i sum  / multiply BM25F scores from match queries with the normalized knn elastic score (cosinesimil) with this. 
vamshin commented 3 years ago

Hi @sujithjoseph,

Above acts like a post filter. For prefilter you could use custom scoring approach. More details on this doc https://github.com/opendistro/for-elasticsearch-docs/blob/master/docs/knn/index.md#custom-scoring. You should be using odfe 1.11.0.0 to use this functionality

Note:- If you only want to use KNN's custom scoring, you can omit "index.knn": true.

sujithjoseph commented 3 years ago

Thanks @vamshin . Why do you recommend to have pre-filter option to filter down results to just 20K alone? Beyond it, are there any performance issues. I am assuming that this always perform better than the post filter option. I do have another Question as well. Since you are storing vectors, I am assuming that the memory mapping of lucene indices will also cache the the vectors in memory over and above the HNSW graph in memory. Is that correct? If I I omit "index.knn": true, i am assuming I wouldn't be able to do this type of queries - "knn" : { "doc" : { "k" : 10000, "vector" : [ -7.321715E-4....] }

sujithjoseph commented 3 years ago

This link (https://github.com/opendistro/for-elasticsearch-docs/tree/master/docs/elasticsearch#primary-and-replica-shards) wasn't working. I am assuming 25 GB per shard as best practice for a knn search. Is that valid as well for knn search?

sujithjoseph commented 3 years ago

@vamshin - how can i combine the BM25F and Knn score, - Will the below approach work?


"query": {
    "bool": {
      "must": [
        {
          "function_score": {
            "query": {
              "knn": {
                  "my_vector": {
                  "vector": [7, 8],
                  "k": 2
                  }       
              }
          },
            "weight": 0.5
          }
        },
        {
          "function_score": {
            "query": {
             <match Query>
            },
            "weight": 0.75
          }
        }
      ]```
    } , If it does work, how can i combine this approach with pre-filters as well .
vamshin commented 3 years ago

Thanks @vamshin . Why do you recommend to have pre-filter option to filter down results to just 20K alone? Beyond it, are there any performance issues. I am assuming that this always perform better than the post filter option. I do have another Question as well. Since you are storing vectors, I am assuming that the memory mapping of lucene indices will also cache the the vectors in memory over and above the HNSW graph in memory. Is that correct? If I I omit "index.knn": true, i am assuming I wouldn't be able to do this type of queries - "knn" : { "doc" : { "k" : 10000, "vector" : [ -7.321715E-4....] }

In Prefilter/custom-scoring option, we do a bruteforce KNN. So more the documents more the latency. If the filtered set of documents are expected to be high, you could do more sharding so that this bruteforce can run in parallel in each of the shard. This should compensate latencies.

I do have another Question as well. Since you are storing vectors, I am assuming that the memory mapping of lucene indices will also cache the the vectors in memory over and above the HNSW graph in memory. Is that correct? If I I omit "index.knn": true, i am assuming I wouldn't be able to do this type of queries - "knn" : { "doc" : { "k" : 10000, "vector" : [ -7.321715E-4....] }

Thats right. Lucene stores these vectors as binary doc values. If you omit "index.knn":true, then hsnw graphs are not constructed during index, so you would loose ability to perform "knn" query clause queries which is Approximate knn.

General philosophy here is if the use case is to run KNN on static set of large number of documents, then AKNN(approximate knn which is hnsw graph based approach) would be preferred. If the use case is to run KNN dynamically on the filtered set of documents, then custom scoring approach is preffered

vamshin commented 3 years ago

This link (https://github.com/opendistro/for-elasticsearch-docs/tree/master/docs/elasticsearch#primary-and-replica-shards) wasn't working. I am assuming 25 GB per shard as best practice for a knn search. Is that valid as well for knn search?

Correct thats valid. KNN indices should be treated like any other Elasticsearch indices. So all the best practices should be applicable to KNN indices as well

sujithjoseph commented 3 years ago

Thanks @vamshin . I was also looking for a way to evict certain graphs (of certain indices) from memory, which i had cached using warm up API. I didnt want to do a restart of the nodes. Any other way to do this?

vamshin commented 3 years ago

Hi @sujithjoseph,

Unfortunately we do not have api to remove graphs of an index. We do have the task to fix it. We would prioritize in future releases. https://github.com/opendistro-for-elasticsearch/k-NN/issues/68

Couple of workarounds:-

  1. When you delete an index. Graphs for this index are evicted from cache.
  2. If you would like to evict all graphs with out restart, you can toggle some cluster setting for example "knn.memory.circuit_breaker.limit". By default this value is 50%, you can change to 51% to trigger cache evictions.

PUT /_cluster/settings { "persistent" : { "knn.memory.circuit_breaker.limit" : "55%" } }

sujithjoseph commented 3 years ago

Thanks @vamshin . Why do you recommend to have pre-filter option to filter down results to just 20K alone? Beyond it, are there any performance issues. I am assuming that this always perform better than the post filter option. I do have another Question as well. Since you are storing vectors, I am assuming that the memory mapping of lucene indices will also cache the the vectors in memory over and above the HNSW graph in memory. Is that correct? If I I omit "index.knn": true, i am assuming I wouldn't be able to do this type of queries - "knn" : { "doc" : { "k" : 10000, "vector" : [ -7.321715E-4....] }

In Prefilter/custom-scoring option, we do a bruteforce KNN. So more the documents more the latency. If the filtered set of documents are expected to be high, you could do more sharding so that this bruteforce can run in parallel in each of the shard. This should compensate latencies.

I do have another Question as well. Since you are storing vectors, I am assuming that the memory mapping of lucene indices will also cache the the vectors in memory over and above the HNSW graph in memory. Is that correct? If I I omit "index.knn": true, i am assuming I wouldn't be able to do this type of queries - "knn" : { "doc" : { "k" : 10000, "vector" : [ -7.321715E-4....] }

Thats right. Lucene stores these vectors as binary doc values. If you omit "index.knn":true, then hsnw graphs are not constructed during index, so you would loose ability to perform "knn" query clause queries which is Approximate knn.

General philosophy here is if the use case is to run KNN on static set of large number of documents, then AKNN(approximate knn which is hnsw graph based approach) would be preferred. If the use case is to run KNN dynamically on the filtered set of documents, then custom scoring approach is preffered

@vamshin , With this custom scoring, i am assuming we can use it for re-ranking results using knn match on embeddings just for top 50 results from Elastic. https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-rescore.html .


POST /_search
{
   "query" : {
      "match" : {
         "message" : {
            "operator" : "or",
            "query" : "the quick brown"
         }
      }
   },
   "rescore" : {
      "window_size" : 50,
      "query" : {
         "rescore_query" : {

  function_score" : {
      "script_score": {
     "query": {
       "match_all": {}
     },
     "script": {
       "source": "knn_score",
       "lang": "knn",
       "params": {
         "field": "my_vector2",
         "query_value": [2, 3, 5, 6],
         "space_type": "cosinesimil"
       }
     }
   }
}
         },
         "query_weight" : 0.7,
         "rescore_query_weight" : 1.2
      }
   }
}```