[BUG] Different nodes return different top K vectors with same model

juntezhang commented 9 months ago

What is the bug?

I have deployed the neural search plugin on a cluster with 2 nodes. I see that the k variable returns different vectors for each node for the same index using the same model and same query. This results in inconsistent results.

How can one reproduce the bug?

I have configured the fields in the mapping as follow:

      "_fulltext_source": {
          "type": "keyword",
          "ignore_above": 8191,
          "index": false,
          "store": false
      },
      "_fulltext_vectorized": {
          "type": "nested",
          "properties": {
            "knn": {
              "type": "knn_vector",
              "dimension": 768,
              "method": {
                "name": "hnsw",
                "engine": "lucene"
              }
            }
        }
      }

This is the query that I am running.

{
  "query": {
    "bool": {
      "must": {
        "bool": {
          "should": [
            {
              "nested": {
                "query": {
                  "neural": {
                    "_fulltext_vectorized.knn": {
                      "query_text": "cowboy",
                      "model_id": "T5fHiIoBJC8u9rWiXx8s",
                      "k": 10
                    }
                  }
                },
                "path": "_fulltext_vectorized",
                "score_mode": "max"
              }
            }
          ]
        }
      }
    }
  }
}

Make sure you have 2 nodes and an index of 1 shard with a replica of 1.

Then run above query in the request body with the preference parameter again each node and see that the results are different.

GET <INDEX>/_search?preference=_only_nodes:<NODE1>

GET <INDEX>/_search?preference=_only_nodes:<NODE2>

What is the expected behavior?

The expected behavior is that the results returned for each node should be the same. The top embeddings should be retrieved and used, consistently, on all shards for the same index.

What is your host/environment?

Running OpenSearch in Docker containers with latest version of Ubuntu.

Do you have any screenshots?

Here you see the different results for the same query, but running it against different nodes.

The top hit is not so good here.

While the top hit here is good and it has a higher score.

Do you have any additional context?

A temporary workaround is to increase k to a higher number.

navneet1v commented 9 months ago

Moving this to K-NN repo as this is related to K-NN plugin and not Neural Search.

navneet1v commented 8 months ago

@juntezhang can you add some sample data which can help reproducing the issue?

navneet1v commented 5 months ago

@juntezhang

This behavior happen because the underline graph data structure is different in primary and replicas. To provide some context: The way replication happens during the indexing of document is:

First the document is indexed in primary shard.
Once the primary acks the document is sent to replica shard. As this is document level replication and the underline data structure(hnsw) is not build consistent this can happen.

Solution: Opensearch has a segment replication feature which rather than sending documents to replicas the whole segment is copied primary shard to replica. Ref: https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/segment-replication/index/

Blog: https://opensearch.org/blog/segment-replication/

juntezhang commented 5 months ago

@navneet1v thank you for your reply. Are you able to reproduce this issue as well? It happens when you have bigger datasets and more indices with replica of 1 on cluster with 2 data nodes. I created a workaround by using the preference parameter, but it's far from ideal.

I have enabled segment replication now, but not sure if this has fixed the issue. Initially I thought this was caused by different cached models. I see that the cache size for KNN is different on each data node. I will check out if this bug is reproducable with an index with segment replication.

navneet1v commented 5 months ago

@juntezhang no I didn't reproduce the bug. But I am aware of this scenario can happen based on my understanding of HNSW and Opensearch. I also know that with Segment replication this should not happen, if the replicas and primaries are in sync, because the same graph is copied to the replicas.

I will check out if this bug is reproducable with an index with segment replication.

Yes please do try that.

opensearch-project / k-NN