[Feature]: Will milvus suppoert BGE-M3['Colbert'] storage and search?

302746420 commented 8 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

Is your feature request related to a problem? Please describe.

In 2.4.0 release,BGE-M3 is supported，but only dense and sparse mode. will you suppoert BGE-M3['Colbert'] storage and search? or is there any way that milvus extists to insert martrix like the type of colbert? Thanks!

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

yiwen92 commented 8 months ago

We had some research on ['Colbert'], it seems need much more memory and a special storage support, so we haven't make a decision for whether support it or not, do you have some solid backgrounds for why you guys are using BGE-M3['Colbert']?

302746420 commented 8 months ago

We had some research on ['Colbert'], it seems need much more memory and a special storage support, so we haven't make a decision for whether support it or not, do you have some solid backgrounds for why you guys are using BGE-M3['Colbert']?

We conduct a RAG service for our users. In our experiment ,we found that combine the ['colbert'] mode with ['sparse'] and ['dense'] can make great progress on metric of recall. So we want to know is there any way to storge the BGE-M3['Colbert'].

xiaofan-luan commented 8 months ago

quick question. Is it colbert for retrieval or colbert for ranking? I thought this is a interesting design decision, from our perspective we didn't see too much improvement on search quality with colbert and colbert takes large amount of storage. So we might be interested what kind of experiment you guyes have did.

From free to connect me at xiaofan.luan@zilliz.com and willing to have a quick talk with you guyes

302746420 commented 8 months ago

answer is for retrieval. we use bge-m3[dense] for coarse recall and bge-m3[dense、sparse、colbert] for fine recall. The hit rate has increased by 2% compare to only use dense and sparse.

xiaofan-luan commented 8 months ago

answer is for retrieval. we use bge-m3[dense] for coarse recall and bge-m3[dense、sparse、colbert] for fine recall. The hit rate has increased by 2% compare to only use dense and sparse.

what ann library/system are u currently using?

xiaofan-luan commented 8 months ago

vespa? store all the tokens embedding might consume a lot of memories

302746420 commented 8 months ago

answer is for retrieval. we use bge-m3[dense] for coarse recall and bge-m3[dense、sparse、colbert] for fine recall. The hit rate has increased by 2% compare to only use dense and sparse.

what ann library/system are u currently using?

FLAT,we do not use ANN. But we test milvus's HNSW.it's great.

302746420 commented 8 months ago

vespa? store all the tokens embedding might consume a lot of memories

memories really is a problem that we should consider.We will test it and balance the cost and efficiency.

Thakns for your relpy.

Mycroft-s commented 7 months ago

quick question. does colbert still take up a lot of memory during the retrival term. could we do embedding at first and retrival later?

xiaofan-luan commented 7 months ago

quick question. does colbert still take up a lot of memory during the retrival term. could we do embedding at first and retrival later?

Colbert is token level embedding. which means for each token you will need one embedding. That's why it takes a lot of memory

Mycroft-s commented 7 months ago

quick question. does colbert still take up a lot of memory during the retrival term. could we do embedding at first and retrival later?

Colbert is token level embedding. which means for each token you will need one embedding. That's why it takes a lot of memory

We do not have a vast number of dataset, so we want to use BGE-M3 hybrid retrival methods with [sparse, dense, colbert] vectors. if Colbert does not take too much time and memory during retrival, we think we could use it for our RAG.

agoriwmt commented 6 months ago

quick question. does colbert still take up a lot of memory during the retrival term. could we do embedding at first and retrival later?

Colbert is token level embedding. which means for each token you will need one embedding. That's why it takes a lot of memory

Doesn't PLAID address this problem?

xiaofan-luan commented 6 months ago

PLAID

what is PLAID? more details?

agoriwmt commented 6 months ago

Sorry, this is the paper: https://arxiv.org/abs/2205.09707 I think there is an implementation here too: https://github.com/bclavie/RAGatouille

Basically, doesn't store every single embedding naively, but creates k centroids and then store just the quantized residuals respect to these centroids (my understanding, I am not a researcher etc).

Also someone was able to index 500 millions documents recently, with an extended approach in a streaming fashion. So not all documents known beforehand, but indexed when they come...as you may guess, early centroids selection can influence performance, but they managed to solve this problem (again my understanding...):

https://arxiv.org/abs/2405.00975

xiaofan-luan commented 6 months ago

upon our test, colbert is slightly better compared to dense embeddings, correct me if there is a dataset that colbert out perform dense embedding a lot. that's my we are doubting this is really helpful. store a quantinized version do help to reduce the cost the storage. but it still dramatically larger than dense embedding size?

agoriwmt commented 6 months ago

I have never compared the size of a dense solution vs colbert. However, I have found a presentation (https://web.stanford.edu/class/cs224v/lectures/ColBERT-Stanford-224V-talk-Nov2023.pdf) where they go from 150gb to 25gb or less, depending on the compression chosen. This is about dataset "MS MARCO Passage Ranking". Same dataset is said to be around 26gb with HNSW: https://arxiv.org/html/2312.01556v1 (if I am reading this correctly).

xiaofan-luan commented 6 months ago

I agreed. I also read that with heavy quantizations colbert can reach similar performance as dense embeddings.

@wxywb has some perf result that can be shared. If anyone did a succesful POC on colbert embeddings please share your result

zc277584121 commented 6 months ago

I have tested the original colbert-v2. It shows that when it is used for reranking, it is relatively faster compared with the cross-encoder, and has a slight improvement compared with embedding model. But it is indeed more expensive in vector storage and distance calculation.

agoriwmt commented 6 months ago

@xiaofan-luan what about this? https://superlinked.com/vectorhub/articles/evaluation-rag-retrieval-chunking-methods

They claim Colbert is better on all datasets they tested by a 10% margin. This is consistent with my tests, but have nothing particular robust to share at the moment.

Also, as a further benefit, consider Colbert is less black box than other solutions and maybe, from a UX perspective, even offer an highlighting feature (like we used to have with lexical solutions).

oedemis commented 1 month ago

whats here the actual status any updates or showcases for both retrieval and re-ranking?

xiaofan-luan commented 1 month ago

colbert and reranking is on our plan for 3.0

xiaofan-luan commented 1 month ago

@liliu-z is actaully working on colbert support and integration. Yes colbert is usually good at accuracy but on the other side it takes more memory and generally means cost.

We are still working on ways to reduce cost of colbert

liliu-z commented 1 month ago

/assign

liliu-z commented 1 month ago

/reopen

sre-ci-robot commented 1 month ago

@liliu-z: Reopened this issue.

In response to [this](https://github.com/milvus-io/milvus/issues/31581#issuecomment-2421517446): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

sskserk commented 1 month ago

Hi Guys,

that's a highly desired feature. Do you hav any predictions on the release date? ...let me volunteer and test it for you using real production data. If you could share a pre-release? I would test it for you.

sskserk commented 1 month ago

colbert and reranking is on our plan for 3.0

Would it be possible to obtain a dev branch for testing?

liliu-z commented 4 weeks ago

Hi Guys,

that's a highly desired feature. Do you hav any predictions on the release date? ...let me volunteer and test it for you using real production data. If you could share a pre-release? I would test it for you.

Can I ask how you are using Colbert for now and how good it is in your scenarios. This can help us listen the sounds from the community better and reprioritize the work.

We are still under investigation stage for now, since merging it with current API is a big challenge. However, we can do something outside the Milvus (from the client side) to mimic the performance of Colbert. Will share some result once we get any and more than welcome for any input about Colbert.

gabor-one commented 3 weeks ago

Hey @liliu-z !

Can I ask how you are using Colbert for now and how good it is in your scenarios. This can help us listen the sounds from the community better and reprioritize the work.

The biggest feature of Colbert (in my opinion) is that the tokens are context-aware. Now, with the JinAI model, we can late-chunk, which somewhat solves the cross-chunk reference problem (e.g., Chunk 1: Joe is a pilot, Chunk 2: He is the best. -> we will never know that Joe is the best pilot.) Colbert would be my choice of embedding method in a case where accuracy is more important speed (e.g.: any scientific application).

Despiko commented 3 weeks ago

Hi Guys, that's a highly desired feature. Do you hav any predictions on the release date? ...let me volunteer and test it for you using real production data. If you could share a pre-release? I would test it for you.

Can I ask how you are using Colbert for now and how good it is in your scenarios. This can help us listen the sounds from the community better and reprioritize the work.

We are still under investigation stage for now, since merging it with current API is a big challenge. However, we can do something outside the Milvus (from the client side) to mimic the performance of Colbert. Will share some result once we get any and more than welcome for any input about Colbert.

I am a colleague with [sskserk] We use Colpali(vision analog of Colbert) in vision RAG. We want to skip table and image extraction and make RAG, which we can use to retrieve images and send them to vLLM. This approach significantly improved metrics compared to classic text RAG. We can use Milvus with external functions... but this approach is not very effective.. so it will be great to use it inside milvus.

sskserk commented 3 weeks ago

Hi Guys, that's a highly desired feature. Do you hav any predictions on the release date? ...let me volunteer and test it for you using real production data. If you could share a pre-release? I would test it for you.

Can I ask how you are using Colbert for now and how good it is in your scenarios. This can help us listen the sounds from the community better and reprioritize the work. We are still under investigation stage for now, since merging it with current API is a big challenge. However, we can do something outside the Milvus (from the client side) to mimic the performance of Colbert. Will share some result once we get any and more than welcome for any input about Colbert.

I am a colleague with [sskserk] We use Colpali(vision analog of Colbert) in vision RAG. We want to skip table and image extraction and make RAG, which we can use to retrieve images and send them to vLLM. This approach significantly improved metrics compared to classic text RAG. We can use Milvus with external functions... but this approach is not very effective.. so it will be great to use it inside milvus.

@liliu-z here is what we are chasing for

liliu-z commented 3 weeks ago

Hi Guys, that's a highly desired feature. Do you hav any predictions on the release date? ...let me volunteer and test it for you using real production data. If you could share a pre-release? I would test it for you.

Can I ask how you are using Colbert for now and how good it is in your scenarios. This can help us listen the sounds from the community better and reprioritize the work. We are still under investigation stage for now, since merging it with current API is a big challenge. However, we can do something outside the Milvus (from the client side) to mimic the performance of Colbert. Will share some result once we get any and more than welcome for any input about Colbert.

I am a colleague with [sskserk] We use Colpali(vision analog of Colbert) in vision RAG. We want to skip table and image extraction and make RAG, which we can use to retrieve images and send them to vLLM. This approach significantly improved metrics compared to classic text RAG. We can use Milvus with external functions... but this approach is not very effective.. so it will be great to use it inside milvus.

@liliu-z here is what we are chasing for

Hi @sskserk @Despiko @gabor-one

Thanks for these infos. We have investigated on this for time and put it in our roadmap. Updates will be keeping shared. Stay tuned!

sskserk commented 3 weeks ago

@liliu-z & @xiaofan-luan,

..as you are on the way, @Despiko and myself are ready to test the dev status of the feature. Rely on us please ... a big company 50K+ employees is behind us ;-)

thank you for the confirmation!

codingjaguar commented 3 weeks ago

Hi @sskserk @Despiko @gabor-one, thank you so much for the interest in Milvus! While we are working on supporting ColBERT/ColPali natively, we have just published a reference implementation for doing ColBERT-style retrieval on the client side with Milvus: https://milvus.io/docs/use_ColPali_with_milvus.md

Basically it stores token/sequence vectors as individual rows in Milvus

        self.client.insert(
            self.collection_name,
            [
                {
                    "vector": colbert_vecs[i],
                    "seq_id": seq_ids[i],
                    "doc_id": doc_ids[i],
                    "doc": docs[i],
                }
                for i in range(seq_length)
            ],
        )

and does heuristic based search of individual vectors to find potentially related docs and fetch all vectors of those docs for calculating maxsim to re-rank them.

        def rerank_single_doc(doc_id, data, client, collection_name):
            # Rerank a single document by retrieving its embeddings and calculating the similarity with the query.
            doc_colbert_vecs = client.query(
                collection_name=collection_name,
                filter=f"doc_id in [{doc_id}, {doc_id + 1}]",
                output_fields=["seq_id", "vector", "doc"],
                limit=1000,
            )
            doc_vecs = np.vstack(
                [doc_colbert_vecs[i]["vector"] for i in range(len(doc_colbert_vecs))]
            )
            score = np.dot(data, doc_vecs.T).max(1).sum()
            return (score, doc_id)

Of course the performance won't be as good compared to native implementation, but it should work for small scale workload like prototypes or experiments. If you have a large scale workload on production using colbert-style retrieval, we'd love to learn more. we can set up a chat to talk about that.

milvus-io / milvus