milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.89k stars 2.95k forks source link

[Feature]: Join two collections #35500

Open xiaofan-luan opened 3 months ago

xiaofan-luan commented 3 months ago

Is there an existing issue for this?

Is your feature request related to a problem? Please describe.

Under some use cases, user need to search for topk for each entity of the other collections.

This can be called as a Knn Join or semantic join.

Simply list it here and wait for more discussion

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

chasingegg commented 3 months ago

We could have something like batching search in vector search engine, this is helpful when we use IVF related indexes, we can group the same posting lists for different queries and do the matrix computation to improve qps.

xiaofan-luan commented 3 months ago

That is exactly what I'm thinking. To implement this, we need

  1. LRU on segments (usuaully we don't need to load everything into main memory)
  2. Batch search on all segments (typically NQ == 100k)
  3. Using GPU or other batch optimizations in index. Under this mode, we don't really need to do batch insertion
xiaofan-luan commented 3 months ago

@liliu-z @chasingegg thoughts on it?

liliu-z commented 3 months ago
  1. An async/cron job API is needed.
  2. It is a general operation that can apply to any indexes and cache strategies (Segment LRU, all Memory, etc.). But we have some prefer combination.
  3. It can be a Map-Reduce pattern, we first do batch searches and store results on a cronjob leader node (maybe delegator). And then do a reduce work upon it.