opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

Optimize reduceToTopK in ResultUtil by removing pre-filling and reducing peek calls #2146

Closed junqiu-lei closed 1 month ago

junqiu-lei commented 2 months ago

Description

This PR optimizes the reduceToTopK method by eliminating unnecessary pre-filling of the priority queue, reducing redundant peek() calls, and adding null safety checks for better performance.

Related Issues

Closes https://github.com/opensearch-project/k-NN/issues/2145

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

junqiu-lei commented 2 months ago

Offline synced with @navneet1v and @jmazanec15, we now focused on optimizing reduceToTopK function in ResultUtil, updated the PR.

junqiu-lei commented 2 months ago

@junqiu-lei Did we compare removing the map with this approach to see which is faster? or is it not possible? I feel like its a simpler code that way if its yielding better results

We might have to first check if KNNResults array is in order. but that way we can do something similar to what lucene does and just pick the top elements and stop at k in reduceToTopK

I think this is another place we can optimize, no we don't have comparing the difference so far. For this PR, we might can use micro benchmark for testing the improvement if possible.

junqiu-lei commented 1 month ago

So, I think the most optimal way to do this would be to require that each perLeafResults comes in order from top scores to worst scores. Then, a simple algorithm that basically just scans the top of each set of results and then only takes the top k. That being, that would require how results are returned in the different methods. Do you think we should do this @navneet1v ?

I can raise other PR for this part optimization.

navneet1v commented 1 month ago

So, I think the most optimal way to do this would be to require that each perLeafResults comes in order from top scores to worst scores. Then, a simple algorithm that basically just scans the top of each set of results and then only takes the top k. That being, that would require how results are returned in the different methods. Do you think we should do this @navneet1v ?

Yes I think we should change the return types