stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.68k stars 355 forks source link

Colbert for handling multilingual passages and queries #252

Open KeshavSingh29 opened 9 months ago

KeshavSingh29 commented 9 months ago

Hello there,

First of all, thanks for the amazing work you guys have put out here. I have been playing around with ColBert for a while and came across a problem.

I'm building a reranking pipeline where passages (max 15) can be in multiple languages and input query as well. I'm already using open-ai embeddings to do round one of retrieval and would like to use colbert for reranking the results.

Is there a recommended way to use ColBert for this purpose? Shall i use a multilingual model for encoding or shall I fine-tune existing checkpoint on my domain-specific data (have 10,000 instances with 1:10 ratio of positive to negative passages per query)?

Currently focusing on Japanese, English, Chinese, Korean and German languages.

Thanks, any advice is much appreciated.