opensearch-project / k-NN

đŸ†• Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

Refactor Around Mapper and Mapping #1939

Closed jmazanec15 closed 3 months ago

jmazanec15 commented 3 months ago

Description

Refactoring of the current FieldMapper structure with the main motivation that simplification will make easier to make changes with confidence. A good amount of complexity has built up given that there are multiple ways to create an index (LegacyFieldMapper, Model, Lucene and Method) with different data types. I wanted to re-organize so that we can extend easier.

In general, this refactor does the following:

  1. It removes branching around encoder and spacetype for parsing. It does so by adding per dimension validators and processors and creating implementations for the different types. Then it adds a couple methods that can be implemented by the specific FieldMappers.
  2. It encapsulates the dimension, knnmethodcontext and modelId behind a new class called ANNConfig. Im not 100% sure on this name, so would appreciate any suggestions. The main purpose of this class is to control the access to this information from calling logic so that it is simpler and safer to handle the branches. For example, no implicit need to deal with -1 dimensions.
  3. It removes the LegacyFieldMapper and uses MethodFieldMapper and builds the knnMethodContext from the settings. The LegacyFieldMapper is not really necessary given that we can just use the information from the settings to build the knnMethodContext. It also eliminates some branching logic in the KNN80DocValuesConsumer
  4. It creates a new FieldMapper called FlatFieldMapper (also open to suggestions). This fieldMapper is used if knn=false. It can be repurposed in the future for more direct control over exact search. This should be a separate class because it doesnt need a lot of things that the other, ANN specific mappings require, such as space types.

One note - I did leave out changing too much around the VectorDataType. I think this should eventually go in the ANNConfig class, but will leave for another time.

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.