Indexing stuck at encoding passages

shubham526 commented 1 month ago

I have a huge collection of 116 million passages. I am trying to create a colbert index for them using the indexing code given on the README. To manage the huge size, I am indexing them in batches of 1000 passages. However, the indexing step seems to be stuck at the encoding stage:

2024-07-09 09:15:44,053 - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[Jul 09, 09:15:50] [1]           #> Encoding 999 passages..
[Jul 09, 09:15:50] [0]           # of sampled PIDs = 2000        sampled_pids[:3] = [853, 1500, 20]
[Jul 09, 09:15:50] [0]           #> Encoding 1001 passages..

Is this supposed to take so much time? Not sure if I am doing something wrong.

shubham526 commented 1 month ago

@okhat

okhat commented 1 month ago

It shouldn’t get stuck — so if it does that’s odd. But don’t index in 1000 passage batches. Index maybe 10 million at a time.

stanford-futuredata / ColBERT

Indexing stuck at encoding passages #355