stanford-futuredata / ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
MIT License
2.87k stars 372 forks source link

colbert.train FAILED: running distributed training #213

Open liudan111 opened 1 year ago

liudan111 commented 1 year ago

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

colbert.train FAILED

Failures: [1]: time : 2023-06-13_12:25:13 host : tu-c0r1n00.bullx rank : 1 (local_rank: 1) exitcode : 1 (pid: 182503) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-06-13_12:25:13 host : tu-c0r1n00.bullx rank : 0 (local_rank: 0) exitcode : 1 (pid: 182502) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

okhat commented 1 year ago

Hmm do you have the traceback? This isn't enough info to help understand what went wrong unfortunately

okhat commented 1 year ago

also fwiw >95% of usecases don't need training; have you considered using the colbertv2 checkpoint we released in the intro.ipynb notebook linked from the readme?

liudan111 commented 1 year ago

Hmm do you have the traceback? This isn't enough info to help understand what went wrong unfortunately

I try to train my data on multi-nodes and multi-GPUs, Can I import parameters like below :

python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK} MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4)) MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1) WORLD_SIZE=${SLURM_NTASKS} -m colbert.train --amp --query_maxlen 128 --doc_maxlen 256 --mask-punctuation --bsize 32 --accum 1 --triples ~/ColBERT/data/triples.train.small.tsv --root model/colesm_cosine_vh --experiment psg --similarity cosine

okhat commented 1 year ago

@liudan111 I don't think we support multi-node unfortunately. Also, with batch size 32 you won't benefit from having 8 GPUs anyway. Four GPUs will be very fast!

liudan111 commented 1 year ago

@liudan111 I don't think we support multi-node unfortunately. Also, with batch size 32 you won't benefit from having 8 GPUs anyway. Four GPUs will be very fast!

Thanks for your reply, so for multi-nodes and GPUs, I should set batch size larger to use all nodes and gpus?

okhat commented 1 year ago

Batch size 32 is fine. But please just use one node, with 4 gpus. That will work well

liudan111 commented 1 year ago

Batch size 32 is fine. But please just use one node, with 4 gpus. That will work well

That's great, I am wondering if I can adapt it to multi-nodes, as I have a quite large data, training is so slow on one node.

okhat commented 1 year ago

Hmm how large is the data? I’ve never seen a meaningful benefit from training on more than 10M triples. With 4 gpus, you can train on this many in a couple of days, so it’s pretty fast.

We don’t support multi-node training unfortunately

liudan111 commented 1 year ago

I have 45M triples, so I am trying to change utils/distributed.py to make colbert works on 2 nodes and 8GPUs. It means that training will not speed up, although I used two nodes and 4 GPUs on each node?