Closed javier-alvarez closed 2 years ago
Hi @javier-alvarez,
Do you have interactive access to the machine you're running on here? if so can you show me the results of ds_report
?
[stderr]Exception: Installed CUDA version 10.2 does not match the version torch was compiled with 11.1, unable to compile cuda/cpp extensions without a matching cuda version.
The above error is thrown as a safety precaution, what DeepSpeed is observing is that nvcc
is reporting CUDA 10.2 but the installed version of torch
was compiled with CUDA 11.1. DeepSpeed uses nvcc
to compile some of our custom c++/cuda ops at runtime, if the version of nvcc
and torch
don't align then the ops will not run properly.
We pick up the nvcc
path from torch.utils.cpp_extension.CUDA_HOME
, if this path isn't the correct path for your environment then there might be issues.
Was able to reproduce the issue with your conda environment, after adding cudatoolkit-dev=11.1.1
to your conda dependencies it seems to have resolved the issue on my side.
This fixed the issue. I have changed the Azure ML image to:
"mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.1-cudnn8-ubuntu18.04"
It looks like having both cuda 10.2 and 11.1 does not work.
Discussed in https://github.com/microsoft/DeepSpeed/discussions/1619