I am testing indexing 25 millions passages . The most time consuming job is saving the embedding in a sequential manner (see the attached images ) . I have 1000 chunks saving jobs and each one takes around around 1 min to finish roughly . I am just wondering
a. Is there any discussion about how doc_maxlen will impact the effectiveness of retrieval ?
b. Is there any way to speed up this embedding/chunks saving job ?
c. I shall have a 200GB index file , is there any existing distributed searching framework if the index file is getting even larger ( actually I am thinking we can have 5 indexing jobs with separated index file , then use one query to search each index file for top 100 results , then do the reranking for these 500 results , i guess it should be the same result as index only one job . Using this mechanism , we might have a parallel index operations ).
This is not an issues .
I am testing indexing 25 millions passages . The most time consuming job is saving the embedding in a sequential manner (see the attached images ) . I have 1000 chunks saving jobs and each one takes around around 1 min to finish roughly . I am just wondering
a. Is there any discussion about how doc_maxlen will impact the effectiveness of retrieval ? b. Is there any way to speed up this embedding/chunks saving job ? c. I shall have a 200GB index file , is there any existing distributed searching framework if the index file is getting even larger ( actually I am thinking we can have 5 indexing jobs with separated index file , then use one query to search each index file for top 100 results , then do the reranking for these 500 results , i guess it should be the same result as index only one job . Using this mechanism , we might have a parallel index operations ).