[Feature]: Multi-Stage Retrieval and Clustering for Unordered Data

milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications

https://milvus.io

Apache License 2.0

30.36k stars 2.91k forks source link

[Feature]: Multi-Stage Retrieval and Clustering for Unordered Data #26661

Open sevenold opened 1 year ago

sevenold commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Is your feature request related to a problem? Please describe.

when searching through unordered data, a threshold of 0.99 is set. In the initial search, 100 matches are found. These 100 matches are then utilized for a subsequent search, which yields an additional 20 matches. This new set of 20 matches is then employed for another search. Multiple searches until there are no more additions,Ultimately, the results are deduplicated, resulting in a total of 120 retrieved items.

Alternatively, it might be possible to develop a clustering interface that aims to get the most closely related clusters.

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

jiaoew1991 commented 1 year ago

Hi @sevenold In Milvus 2.3.0, a new range search feature has been added, which allows you to search for all data within a certain distance. Does this meet your requirements? https://milvus.io/docs/within_range.md

sevenold commented 1 year ago

@jiaoew1991 Each of my searches is performed using a range search operation. In the actual business scenario, it's also not feasible to retrieve all relevant results at once, requiring multiple search iterations.

params={'metric_type': 'IP', 'params': {'nprobe': 32, 'radius': 0.99}})

jiaoew1991 commented 1 year ago

@sevenold I will try to describe your scenario and see if my understanding is correct. Actually, what you want is to return an array, where each array contains a cluster center, and each cluster contains the top k closest vectors to the cluster center. 🤔

sevenold commented 1 year ago

@jiaoew1991 Just like in the diagram below, the final result is the deduplicated set B. The stopping condition could be based on the number of levels or when no new B.

jiaoew1991 commented 1 year ago

@sevenold Thank you for the explanation. I understand your needs now. Could you please further explain your business scenario? We can discuss whether there is a better solution based on the specific business context. 😄

xiaofan-luan commented 1 year ago

according the index definition, neighbour's neighbour should also be your neighbour? maybe use a search iterator and iterate from nn of A could work?