stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.95k stars 377 forks source link

Parallel indexing operations #296

Open deter3 opened 8 months ago

deter3 commented 8 months ago

This is not an issues .

I am testing indexing 25 millions passages . The most time consuming job is saving the embedding in a sequential manner (see the attached images ) . I have 1000 chunks saving jobs and each one takes around around 1 min to finish roughly . I am just wondering

a. Is there any discussion about how doc_maxlen will impact the effectiveness of retrieval ? b. Is there any way to speed up this embedding/chunks saving job ? c. I shall have a 200GB index file , is there any existing distributed searching framework if the index file is getting even larger ( actually I am thinking we can have 5 indexing jobs with separated index file , then use one query to search each index file for top 100 results , then do the reranking for these 500 results , i guess it should be the same result as index only one job . Using this mechanism , we might have a parallel index operations ).

8E3EB35E-2662-466D-9C50-EB30F917A69E