Closed marceljahnke closed 1 year ago
Hi @marceljahnke tevatron uses DDP for multi-gpu training
i.e. run with python -m torch.distributed.launch --nproc_per_node=4 -m tevatron.driver.train
e.g. https://github.com/texttron/tevatron/tree/main/examples/dpr#2-train
Hi @MXueguang, thank you for the answer. It worked.
Unfortunately another error occurred during the searching:
python -m tevatron.faiss_retriever.reducer --score_dir ranking/intermediate --query encoding/qry.pt --save_ranking_to ranking/rank.txt
0%| | 0/10 [00:00<?, ?it/s]Initializing Heap. Assuming 6980 queries.
Traceback (most recent call last):
File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/faiss_retriever/reducer.py", line 48, in <module>
main()
File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/faiss_retriever/reducer.py", line 41, in main
corpus_scores, corpus_indices = combine_faiss_results(map(torch.load, tqdm(partitions)))
File "/dstore/home/jahnke/master-dense-retrieval/tevatron/src/tevatron/faiss_retriever/reducer.py", line 16, in combine_faiss_results
rh.add_result(-scores, indices)
File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/faiss/__init__.py", line 1622, in add_result
swig_ptr(I), self.k)
File "/home/jahnke/miniconda3/envs/tevatron/lib/python3.8/site-packages/faiss/swigfaiss.py", line 5700, in swig_ptr
return _swigfaiss.swig_ptr(a)
ValueError: did not recognize array type
0%| | 0/10 [00:00<?, ?it/s]
@MXueguang The reducer problem.. Have we decided how to deal with https://github.com/texttron/tevatron/pull/13 ?
Bug
When following the MS MARCO passage ranking example there is a RuntimeError when using multiple GPUs for training.
Starting the training via
produces:
Note: When running the training with above command and only one visible gpu the training starts and runs correctly.
Full Error Message
Environment
CUDA Version: 10.1 Operating System: Debian GNU/Linux 10 (buster) Kernel: Linux 4.19.0-18-amd64 GPUs: 4x GTX 1080Ti 11GB CPU: Intel E5-2620v4