opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
153 stars 113 forks source link

Reproducibility of Faiss model #379

Closed andBabaev closed 2 years ago

andBabaev commented 2 years ago

Hi all,

I am working on a 'faiss ivf flat' model. These are my model settings

{
            "training_index": self.train_index_name,
            "training_field": self.train_vector_name,
            "dimension": 24,
            "description": "My models description",
            "method": {
                "name": "ivf",
                "space_type": "l2",
                "engine": "faiss",
                "parameters": {
                    "nlist": 400,
                    "encoder": {"name": "flat"},
                },
            },
        }

The training index includes 200,000 vectors. I trained the model several times on this data and each time the trained model gave different results for the same test vector.

At the same time, when I train models with the same parameters through the faiss Python library, I get completely models that give the same results.

Is it possible to train reproducible models with opensearch?

My code for model with python

self.quantiser = faiss.IndexFlatL2(features.shape[1])
self.index = faiss.IndexIVFFlat(
            self.quantiser, features.shape[1], self.nlist, faiss.METRIC_L2
)
self.index.train(features.astype(np.float32))
self.index.add(features.astype(np.float32))
jmazanec15 commented 2 years ago

Hi @andBabaev

There are 2 components of indexing here: creating the model and adding the vectors. I suspect that the difference is occurring due to the adding of vectors.

For the model, if you are using all of the training vectors in the training index, I would suspect that the model would be the same. To confirm this, could you get the model with the GET model API and confirm they are identical? https://opensearch.org/docs/latest/search-plugins/knn/api/#get-model

For indexing, a vector will be added to one of the shards. Each shard will have a set of immutable segments. During search, each segment will get searched and the results will be aggregated. The faiss index will map to a file in one of these segments - all vectors in this segment will be added to the faiss index for that segment. Depending on how many shards you have and how many segments are created for each shard, results may vary. To get more consistent results, I would try incrementing the nprobes parameter. Also, you can force merge the number of segments to 1 for the index.

A few followup questions:

  1. How many nodes?
  2. How many primary and replica shards?
  3. How many segments does each shard have?