milvus-io / pymilvus

Python SDK for Milvus.
Apache License 2.0
1.03k stars 328 forks source link

[QUESTION]: Search multiple collections #1087

Open PeterPilley opened 2 years ago

PeterPilley commented 2 years ago

Is there an existing issue for this?

What is your question?

is it possible to search multiple collections in one search. I know in the library we have to nominate a collection and then perform a query on that but I was hoping that there was a way to search 2 from one query.

Anything else?

No response

shanghaikid commented 2 years ago

Just be curious what scenario is your case?

Search one vector in two collections from one request, what kind of result are you expecting to return from Milvus?

PeterPilley commented 2 years ago

So we have a collection that is a base collection that is not updated by our users and then a collection that they can update.

For a query we do them in series however it would be awesome if we could search multiple collections that have a similar schema

Does that make sense.?

shanghaikid commented 2 years ago

@xiaofan-luan any thoughts?

yanliang567 commented 2 years ago

so you want to

So we have a collection that is a base collection that is not updated by our users and then a collection that they can update.

For a query we do them in series however it would be awesome if we could search multiple collections that have a similar schema

Does that make sense.?

if copying one query request and send them to different collections, is it working for you? do you expect multiple search results or one combined search result?

PeterPilley commented 2 years ago

Hi Sorry I thought I had replied to this, doing multiple requests is okay but it would be great to have one combined search result if multiple collections are specified and provided they have the same or similar enough schema

xiaofan-luan commented 2 years ago

This might not a common demands since I thought different collection may have total schema and it's hard to search among collections. If they are all same schema why not use multiple partition instead?

PeterPilley commented 2 years ago

Can you explain the multiple partitions?

We have at the moment 2 (we will have more in the future) collections with similar schema, one is read only for the user the other is read-write.

jayitsaha commented 3 months ago

Did we get any resolution to this use case?

I had one other use case that I am facing challenges on, Suppose we have "n" collections, I want to do a single similarity search on these two collections on a common vector field. Now I want to give weights to each of these collections which will bring affect to the search results. Is there anything for this type of scenarios?

yanliang567 commented 3 months ago

did not receive many requirements for this, so it is still open for discussion.

jayitsaha commented 3 months ago

Also as a separate idea, suppose in the same collection itself I have a metadata column "Source" which contains the different sources the vector db has been created from. Now imagine, Source column which contains 10 unique sources, and I want to give weights to each of this source so that that Weight is percolated into the similarity function.

I would want to have an alpha-beta balance between the similarity score and the weight to give the final rank of retrieved results.

How does that sound @yanliang567 ?

xiaofan-luan commented 3 weeks ago

ow I want to give weights to each of these collections which will bring affect to the search results. Is there anything for this type of scenarios?

what about one collection but two vector fields? check hybrid search on mivlus

xiaofan-luan commented 3 weeks ago

另外,作为一个单独的想法,假设在同一个集合中,我有一个元数据列“Source”,其中包含创建向量数据库的不同来源。现在想象一下,Source 列包含 10 个唯一来源,我想为每个来源赋予权重,以便该权重渗透到相似度函数中。

我希望相似度分数和权重之间有一个 alpha-beta 平衡,以给出检索结果的最终排名。

听起来怎么样@yanliang567?

Also as a separate idea, suppose in the same collection itself I have a metadata column "Source" which contains the different sources the vector db has been created from. Now imagine, Source column which contains 10 unique sources, and I want to give weights to each of this source so that that Weight is percolated into the similarity function.

I would want to have an alpha-beta balance between the similarity score and the weight to give the final rank of retrieved results.

How does that sound @yanliang567 ?

How would rerank work on this case?

basically you search top 100 most similar result, and rerank with a field source and the original score

jayitsaha commented 3 weeks ago

So i solved it using a custom weighted retriever function. So the idea was that suppose we have one collection, one vector field, but in that collection we have multiple types of csvs, PDFs, html docs and so many more different sources. Now imagine, we have a total of 60 source docs, out of which 45 are not that reliable, hence i want to have a custom scoring on top of the cosine similarity. Re ranking won't work on this use case, also the information that which data source is reliable and which is not, need to be explicitly passed. As none of the vector dbs i used, had the functionality to change it's similarity function logic, i created one of my own.

xiaofan-luan commented 3 weeks ago

It doens't make sense to support this on search because all vector database build index with certain distance function.

usually you can not create index with one distance metrics and search with another.

xiaofan-luan commented 3 weeks ago

so the weighted ranking can only happened on the rerank stage

jayitsaha commented 3 weeks ago

Absolutely correct, initially i was basically looking for an injection of custom logic into the distance metric, that would have done the work.

Yes, I wasn't planning to do it at the re rank stage, so the easy fix was take a large k, say 200.

Now those 200 retrieved docs are a mix of reliable and non reliable sources, and on this, I applied my custom similarity/weighted RAG function.

And this time say out of 200, only 20 made it through based.

Basically changed the distance metric a little bit, based on the weight to be used for a unreliable source vs a different weight to be used for a reliable one.

We can definitely get on a call and discuss further on this if required.

xiaofan-luan commented 3 weeks ago

Hi Jayit,

This seems to be exactly what we want to do. I already opened a issue for that, feel free to comment if that's what you need.

For usage, we first do a nn search with IP or L2 distance, get 200 result, and also retrieve some scalar field, like time, catogory then user is allowed to write a UDF to rerank based on the similarity distance and field.

For example, you can decay by time, or add special weight for hot product.

https://calendly.com/xiaofan-luan/james-emeet

jayitsaha commented 3 weeks ago

aaah pretty much the same yes! That's amazing! 😇

However, there are a few caveats to this. Imagine, a langchain implementation of the same (was really a tough one) so imagine when you get those 200 results from the db, and you write the udf to weight them custom. But there came another use case to maybe use MMR or something. And for that I had to put into a vector db real time, which is a very very costly process. Still trying to figure out a more efficient solution. And hence circling back to the same thing, what if we give directly a custom function to the user to give define their own distance metric, now they may choose to include n number of parameters, but as long as the value say remains between the limits, the Milvus stays unaffected.

PS: I achieved the langchain implementation by using the BaseRetriever class and making changes on top of that, made my Weighted Rag Udf work like a charm.

xiaofan-luan commented 3 weeks ago

MMR is a interesting rerank algorithm to implement!

jayitsaha commented 3 weeks ago

Affirmative! 😇