stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.68k stars 355 forks source link

Indexing by using Faiss in ColBERTv1 #251

Closed MarkLee131 closed 9 months ago

MarkLee131 commented 9 months ago

Hello @okhat,

I am working on a document retrieval project using ColBERTv1. While I was able to train the model successfully following the steps provided in the README, I encountered issues during the indexing phase. I would appreciate your guidance on resolving them.

Commands Used for Indexing: I tried indexing my dataset in two steps, as suggested in Issue #73:

  1. Preparing the collections and creating the indexes:
CUDA_VISIBLE_DEVICES="0,1,2,3" OMP_NUM_THREADS=6 \
python -m torch.distributed.launch --nproc_per_node=4 -m \
colbert.index --amp --doc_maxlen 512 --mask-punctuation --bsize 512 \
--checkpoint /mnt/local/Baselines_Bugs/ColBERT/commits_exp/commits_train/train.py/test.l2/checkpoints/colbert.dnn \
--collection /mnt/local/Baselines_Bugs/ColBERT/data/collection_all.tsv \
--index_root /mnt/local/Baselines_Bugs/ColBERT/commits_indexes --index_name train_index \
--root index_output --experiment commits_train

For this step, we successfully got the index files under the folder: commits_indexes/train_index.

  1. Then we tried to use end2end indexing by using faiss:

The command we used for it:

export RANK=0 \
export CUDA_VISIBLE_DEVICES="0,1,2,3" \
export MASTER_ADDR=127.0.0.1 \
export MASTER_PORT=29501 
python -m colbert.index_faiss \
--index_root /mnt/local/Baselines_Bugs/ColBERT/commits_indexes --index_name train_index \
--partitions 4715 --sample 0.3 \
--root index_output --experiment commits_train

But it seems to run stuck and without GPU usage.

Mon Sep 18 16:22:43 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:1B:00.0 Off |                  N/A |
| 30%   31C    P8    22W / 350W |      2MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:1C:00.0 Off |                  N/A |
| 30%   29C    P8    21W / 350W |      2MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:1D:00.0 Off |                  N/A |
| 30%   31C    P8    19W / 350W |      2MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:1E:00.0 Off |                  N/A |
| 30%   31C    P8    20W / 350W |      2MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
MarkLee131 commented 9 months ago

Well, I found the cause. The code in the loaders.py searches files with extension .pt, but my collections are saved in tsv before. So, I need to switch the lines between the first step (.tsv) and second (.pt). image