opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
155 stars 114 forks source link

[Segment Replication] Add tests for Segment replication feature #921

Open dreamer-89 opened 1 year ago

dreamer-89 commented 1 year ago

Recently, knn plugin integration was found broken for 2.7.0 version (issue on core https://github.com/opensearch-project/OpenSearch/issues/7781). This is due to bugs on core where 1) codec names were used before allowing replica to copy files from primary. 2) using default codec name on replica which is current version of Lucene codec e.g. Lucene95 today

Due to 2) above, the replication events for kNN indices were blocked from primary because it's lucene codec name (KNNXXXCodec) differs from replica (LuceneXXX). The end result was replica shard remained unassigned forever.

As part of this issue, we need to add tests so that these issues be caught before release.

  1. Integ Tests
  2. Bwc Tests

Background on core issue

Segment replicaion event fails for plugins using custom codecs (e.g. kNN). The failure prevents replica shard from allocation. The end result is replica shard remains unassigned and will remain forever.

During peer recovery, for segment replication enabled indices a force segment sync is performed to keep the shard upto date from primary. Recently, we added a fix where to prevent segment replication events b/w primary and replica when they are using a different codec implementations. This is problematic as we fetched replica shard default codec rather than the one on the engine config.

navneet1v commented 1 year ago

@dreamer-89 can you provide some details like what is the expected solution?

dreamer-89 commented 1 year ago

@dreamer-89 can you provide some details like what is the expected solution?

Thanks @navneet1v for checking on this. The issue description wasn't correct (actually copied from core issue). I updated the issue to make to add tests which is actionable on this repo.

dreamer-89 commented 1 year ago

Opened a PR to add basic integ test: https://github.com/opensearch-project/k-NN/pull/927

I see by default, tests use one node cluster which is not useful for segment replication. Thus, need some opinion from repo owners around structuring this into a separate test class which tied to a different gradle task or identify if newly added test can be run in multi-node setup. Tagging @navneet1v @naveentatikonda for visibility.