stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.8k stars 368 forks source link

Indexing stuck at encoding passages #355

Open shubham526 opened 1 month ago

shubham526 commented 1 month ago

I have a huge collection of 116 million passages. I am trying to create a colbert index for them using the indexing code given on the README. To manage the huge size, I am indexing them in batches of 1000 passages. However, the indexing step seems to be stuck at the encoding stage:

2024-07-09 09:15:44,053 - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[Jul 09, 09:15:50] [1]           #> Encoding 999 passages..
[Jul 09, 09:15:50] [0]           # of sampled PIDs = 2000        sampled_pids[:3] = [853, 1500, 20]
[Jul 09, 09:15:50] [0]           #> Encoding 1001 passages..

Is this supposed to take so much time? Not sure if I am doing something wrong.

shubham526 commented 1 month ago

@okhat

okhat commented 1 month ago

It shouldn’t get stuck — so if it does that’s odd. But don’t index in 1000 passage batches. Index maybe 10 million at a time.