Open liudan111 opened 1 year ago
Hmm do you have the traceback? This isn't enough info to help understand what went wrong unfortunately
also fwiw >95% of usecases don't need training; have you considered using the colbertv2 checkpoint we released in the intro.ipynb notebook linked from the readme?
Hmm do you have the traceback? This isn't enough info to help understand what went wrong unfortunately
I try to train my data on multi-nodes and multi-GPUs, Can I import parameters like below :
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK} MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4)) MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1) WORLD_SIZE=${SLURM_NTASKS} -m colbert.train --amp --query_maxlen 128 --doc_maxlen 256 --mask-punctuation --bsize 32 --accum 1 --triples ~/ColBERT/data/triples.train.small.tsv --root model/colesm_cosine_vh --experiment psg --similarity cosine
@liudan111 I don't think we support multi-node unfortunately. Also, with batch size 32 you won't benefit from having 8 GPUs anyway. Four GPUs will be very fast!
@liudan111 I don't think we support multi-node unfortunately. Also, with batch size 32 you won't benefit from having 8 GPUs anyway. Four GPUs will be very fast!
Thanks for your reply, so for multi-nodes and GPUs, I should set batch size larger to use all nodes and gpus?
Batch size 32 is fine. But please just use one node, with 4 gpus. That will work well
Batch size 32 is fine. But please just use one node, with 4 gpus. That will work well
That's great, I am wondering if I can adapt it to multi-nodes, as I have a quite large data, training is so slow on one node.
Hmm how large is the data? I’ve never seen a meaningful benefit from training on more than 10M triples. With 4 gpus, you can train on this many in a couple of days, so it’s pretty fast.
We don’t support multi-node training unfortunately
I have 45M triples, so I am trying to change utils/distributed.py to make colbert works on 2 nodes and 8GPUs. It means that training will not speed up, although I used two nodes and 4 GPUs on each node?
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
colbert.train FAILED
Failures: [1]: time : 2023-06-13_12:25:13 host : tu-c0r1n00.bullx rank : 1 (local_rank: 1) exitcode : 1 (pid: 182503) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-06-13_12:25:13 host : tu-c0r1n00.bullx rank : 0 (local_rank: 0) exitcode : 1 (pid: 182502) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html