Introduce new setting to configure when to build graph during segment creation

VijayanB commented 1 month ago

Description

Added new updatable index setting "build_vector_data_structure_threshold", which will be considered when to build relevant vector data structure or not for native engines before creating and merging segments. The default value is chosen to be 0 to make it backward compatible. This is applicable only for native engines, and, it is no-op for lucene. We also don't need a feature flag since implementation is behind NativeEngines990KnnVectorsFormat. Hence, only index that are created 2.17 and above will have the ability to pause building graph while indexing or merging based on this updatable index setting.

In this PR, we introduced a setting to decide when to build the graph with min as -1 ( no graph will be built ) and Integer.MAX_Value - 2 ( as max value up to which user can provide ). Currently the default value is 0 ( build graph always ). We will be updating this value to a reasonable value ( greater than 0) to optimize when should we build graph if user didn't provide this value as next PR.

Similarly, during search, if no graph is available, we will call exact search for conssitency.

Related Issues

Part of #1942

Check List

[x] New functionality includes testing.
[ ] New functionality has been documented.
[ ] API changes companion pull request created.
[x] Commits are signed per the DCO using --signoff.
[ ] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

jmazanec15 commented 1 month ago

@VijayanB What is behavior around search before, mixed and after this setting is set from false to true? Are some segments search via exact and others not? Also, why is this an index setting and not field mapping parameter?

heemin32 commented 1 month ago

I could not make a connection between the issue https://github.com/opensearch-project/k-NN/issues/1942 and this PR. How is this PR going to be used for the issue? About delaying of graph creation, shouldn't it be more automated instead user manually turn it on and off?

navneet1v commented 1 month ago

I could not make a connection between the issue #1942 and this PR. How is this PR going to be used for the issue? About delaying of graph creation, shouldn't it be more automated instead user manually turn it on and off?

@heemin32 with this change the solution builds the capability to completely stop the Vector DataStructure creation, which is Delaying Vector Data structures Creation in the GH issue.

Now the GH issue says that graph creation should be done even after force_merge, but as of now it cannot be done, because it requires coming with new API for doing merges. Plus the way force_merge works with 1 segment use-case is it doesn't build temporary segments and directly merges to 1 segment.

So with this capability the idea would be if we are doing indexing once, we ingest data by disabling graph creation and then enable the setting. After that we do force_merge to 1 segment for best performance for index build.

The next part of the issue will be to have a vector threshold based mechanism to see when to build the graph and when not to.

To avoid confusion I added more details on the GH issue.

VijayanB commented 1 month ago

I could not make a connection between the issue #1942 and this PR. How is this PR going to be used for the issue? About delaying of graph creation, shouldn't it be more automated instead user manually turn it on and off?

As a part of incremental graph build approach, this PR will allow user to have ability to pause building graph before completing ingestion process. Before this feature, knn plugin will build graph every time during flush, which contributes to the high total indexing time whenever new segments are being created. If user is going to merge segments after ingestion, those graphs which are created during segment flush will be discarded and new graph will be built from scratch during force merge. This gives additional control to users who wishes to reduce indexing time if their use case doesn't require building graph on every segment creation step.

As follow up, we will provide automated solution to decide whether to build graph or not based on number of vectors in given segment. We will introduce a threshold parameter and will let flush/merge decide whether to build graph or not accordingly.

VijayanB commented 1 month ago

@VijayanB The primary advantage seems to be reduce the CPU spike before merge, Are we seeing any improvements during force merge? Can you add the CPU usage of before and after with this PR? we probably need one for index and one for force merge

Overall can we have a summary of alternative solutions considered? I am a bit hesitant to add a user exposed setting which does not give the improvements out of the box.

The primary advantage is to provide more control to users on when to build vector data structures like graph. In current approach, CPU utilization for indexing is high due to graph creation process which will be later discarded if force merge is executed after ingestion. This setting will allow users to disable graph building step while ingestion ( this will reduce CPU utilization, which in turn improve ingestion throughput), and later update setting before merge in order to build graph as part of merge step, just before search operation.

VijayanB commented 3 weeks ago

Updated this PR's base to feature branch

shatejas commented 2 weeks ago

So NativeEngines990KnnVectorsWriter doesn't have a unit test and the class is getting complex to not have a unit test. I am in the process of adding one but for this particular use case can we add one? I can take care of other use cases

https://github.com/opensearch-project/k-NN/pull/2097 should help you with building a unit test for this

opensearch-project / k-NN