microsoft / ANCE

A novel embedding training algorithm leveraging ANN search and achieved SOTA retrieval on Trec DL 2019 and OpenQA benchmarks
MIT License
359 stars 49 forks source link

CUDA nccl library issue #15

Open francomarianardini opened 3 years ago

francomarianardini commented 3 years ago

Hello,

I cloned this repository because I am interested in running the run_inference.sh command. I followed the steps listed in the readme. However, when I run run_inference, I got the following error

RuntimeError: CUDA error: device-side assert triggered terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:155, unhandled cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed.

My system has NCCL v2.7.8 correctly installed with the corresponding CUDA toolkit.

What am I missing here?

thanks in advance for the help.

best,

Franco Maria