opendistro-for-elasticsearch / k-NN

🆕 A machine learning plugin which supports an approximate k-NN search algorithm for Open Distro.
https://opendistro.github.io/
Apache License 2.0
277 stars 55 forks source link

faiss interface refactoring to support multiple methods #344

Closed jmazanec15 closed 3 years ago

jmazanec15 commented 3 years ago

Issue #, if available:

225

Description of changes: This PR focuses on refactoring current faiss-support branch's interface to support several additional features including:

  1. IVF index type - a cell probe based method that allows a user to reduce search space using a k-Means clustering algorithm. It takes "ncentroids" and "nprobes" as parameters
  2. Product quantization - a method to encode vectors to reduce size. It takes "code_size" as a parameter
  3. Composite indices - the ability to combine different faiss features into a single index

The interface looks like:

{
   "my_vector":{
      "type":"knn_vector",
      "dimension":4,
      "method":{
         "name":"ivf",
         "engine":"faiss",
         "coarse_quantizer":{
            "name":"ivf",
            "parameters":{
               "ncentroids":15
            }
         },
         "encoder":{
            "name":"pq",
            "parameters":{
               "code_size":8
            }
         },
         "parameters":{
            "ncentroids":128
         }
      }
   }
}

The main logic where the interface has been refactored can be found in:

  1. KNNVectorFieldMapper - where the parsing between the user provided method and the the plugin occurs
  2. KNNMethodContext - stored structure of the user provided method configuration
  3. KNNMethod - structure of a given method supported by a particular engine
  4. KNNLibrary - interface for a particular library. Includes implementations for nmslib and faiss
  5. KNNEngine - enum mapping name to KNNLibrary

A lot of code was changed in order to support these additional features:

  1. Because we use faiss's index factory, only a certain portion of the parameters are configured through the index factory string description. To support additional parameters (for example, ef_construction for HNSW), this PR adds functionality to pass an extra parameter map to the jni to be parsed.
  2. Because IVF and PQ require training, in the JNI save index function, this PR implements a training approach where a subset of the data to be indexed is used for training. This is inherently inefficient because it requires each segment to be trained before it can add data to it. In the future, we will introduce a train api that trains before indexing, to work around this.
  3. Several other minor changes to make refactor cleaner/easier

Testing For testing, this PR focuses on addings tests that exercise the interface as opposed to adding end to end tests testing each jni libraries functionality. This is because that functionality will change in the future. Right now, it is just a place holder to get the interface functionality working. That being said, the following test refactoring was done:

  1. Added additional unit tests to test faiss interface
  2. Refactored old tests so that gradle build passes

Future Development

  1. Introduce training api
  2. Add additional end to end tests
  3. Investigate storing data exclusively with faiss (as opposed to storing vectors in doc values in Lucene)

Notes We are in the process of migrating from ODFE to OpenSearch. Included in this will be porting over the faiss-support branch to OpenSearch. Because porting requires significant refactoring, we will merge this PR and then port the faiss-support branch to OpenSearch.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

jmazanec15 commented 3 years ago

Closing PR now. Will continue work on OpenSearch repo.