opendistro-for-elasticsearch / k-NN

🆕 A machine learning plugin which supports an approximate k-NN search algorithm for Open Distro.
https://opendistro.github.io/
Apache License 2.0
277 stars 55 forks source link

High performance impact of data retrieval #338

Closed smodlich closed 3 years ago

smodlich commented 3 years ago

Hi, I'm using odes 1.12.0 with ES 7.10.0. I have a question regarding performance benchmarking. I have a cosine similarity index with 4 million documents and 100D vectors (approximate search). Parameters are: M=48, ef_search=1024, ef_construction=1024. I followed all advice for performance tuning: merge to one segment, retrieve no fields in query, warmup. With this Im getting around 100ms for a query for k=10.000. This is quite good I think and more than 40x faster than exact indexing.

Fast query:

{"stored_fields": "_none_",
"docvalue_fields": "[_id]",
"size": 10000,
  "query": {
    "knn": {
      "sem_vector": {
        "vector": query_vec,
        "k": 10000
      }
    }
  }
} 

However I want to use ES also as data store in my use case. So I want to retrieve more fields from the data, not just id's. As I add only one field to be retrieved to the search request query time drops to 2.5 sec, so 25x slower. Do you have any idea how to avoid it? The fields I retrieve are text fields, integer and date. Slow query:

{"_source": "required_field",
"size": 10000,
  "query": {
    "knn": {
      "sem_vector": {
        "vector": query_vec,
        "k": 10000
      }
    }
  }
} 
vamshin commented 3 years ago

Hi @smodlich,

If there are selective fields you wanted to retrieve, then you could enabled stored=true for the respective field mappings and apply source filtering. This would come at cost of additional storage but faster retrieval.

Note keywords, dates and Numbers should be available in doc values just like you extract "_id".