stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.95k stars 377 forks source link

How to index large corpus in mini batches? #211

Closed sgowdaks closed 1 year ago

sgowdaks commented 1 year ago

Hi, I am trying to index a very large corpus, for which I am creating mini batches and sending to indexer.index( name="msmarco.nbits-2", collection=batch, overwrite='resume' )

Even tough I am specifying resume, I am not able to see any changes in the experiments directory file size. Is there a way to do it in the mini batches. My work station hangs if I pass the whole corpus, my CPU and GPU both has 48 GB RAM.

Thank You

EDIT: sorry, it was problem with my data, I was able to index the corpus. Thanks