Parallel indexing operations

This is not an issues .

I am testing indexing 25 millions passages . The most time consuming job is saving the embedding in a sequential manner (see the attached images ) . I have 1000 chunks saving jobs and each one takes around around 1 min to finish roughly . I am just wondering

a. Is there any discussion about how doc_maxlen will impact the effectiveness of retrieval ? b. Is there any way to speed up this embedding/chunks saving job ? c. I shall have a 200GB index file , is there any existing distributed searching framework if the index file is getting even larger ( actually I am thinking we can have 5 indexing jobs with separated index file , then use one query to search each index file for top 100 results , then do the reranking for these 500 results , i guess it should be the same result as index only one job . Using this mechanism , we might have a parallel index operations ).

stanford-futuredata / ColBERT

Parallel indexing operations #296