Closed CZT0 closed 1 year ago
Hi, I encounter the same issue as you when i use ustc-scc. My environment is very similar to yours. python==3.11.3 pytorch==2.0.1 (py3.11_cuda11.7_cudnn8.5.0_0) transformers==4.29.1 cuda==11.7
Try setting the following two environment variables 1、export PATH=/usr/local/cuda-12.0/bin:$PATH 2、export LD_LIBRARY_PATH=/usr/local/cuda-12.0/lib64:$LD_LIBRARY_PATH
I'm also hitting the same exact error. Did anyone find a solution to this?
Try setting the following two environment variables 1、export PATH=/usr/local/cuda-12.0/bin:$PATH 2、export LD_LIBRARY_PATH=/usr/local/cuda-12.0/lib64:$LD_LIBRARY_PATH
I tried to set it up, but it still gives an error.
Try setting the following two environment variables 1、export PATH=/usr/local/cuda-12.0/bin:$PATH 2、export LD_LIBRARY_PATH=/usr/local/cuda-12.0/lib64:$LD_LIBRARY_PATH
cuda installation path is /opt/cuda
, I check the PATH and LD_LIBRARY_PATH:
>echo $PATH
/opt/cuda/11.7.1_515.65.01/bin:/opt/cuda/11.7.1_515.65.01/nvvm:/home/user/.conda/envs/user/bin:/opt/Anaconda3/2022.05/condabin:/usr/lpp/mmfs/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
>echo $LD_LIBRARY_PATH
/opt/cuda/11.7.1_515.65.01/lib64/stubs:/opt/cuda/11.7.1_515.65.01/lib64
>ls /opt/cuda/11.7.1_515.65.01/lib64/stubs|grep libnvidia
libnvidia-ml.so
libnvidia-ml.so.1
I noticed that there are 2 warnings may related to the error:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:
You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/home/user/.conda/envs/user/lib/python3.11/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
Try setting the following two environment variables 1、export PATH=/usr/local/cuda-12.0/bin:$PATH 2、export LD_LIBRARY_PATH=/usr/local/cuda-12.0/lib64:$LD_LIBRARY_PATH
I tried to set it up, but it still gives an error.
Hi everyone,
I have managed to resolve the issue I was experiencing when trying to initialize TorchBackend with DeepSpeed using the backend nccl. The operation was failing with return code = -11.
The solution was to disable the InfiniBand (IB) transport in NCCL by setting the NCCL_IB_DISABLE environment variable to 1. To do this, simply run the following command before executing your script:
export NCCL_IB_DISABLE=1
This command tells NCCL not to use InfiniBand as the transport method for communication. Disabling InfiniBand might be necessary in certain system configurations or when InfiniBand is not available.
After applying this solution, I was able to successfully run my script without any issues. If you are encountering a similar problem, I hope this solution helps you as well.
Best regards
BTW, only use export NCCL_IB_DISABLE=1
do not work in my situation (2x8v100 on cluster machine). It can not fully disable IB. So I use the following cmd:
export NCCL_IB_DISABLE=1
export NCCL_IBEXT_DISABLE=1
Hello,
I am experiencing an issue when trying to initialize TorchBackend with DeepSpeed using the backend nccl. The operation fails with return code = -11.
Here are the steps to reproduce the issue:
I have successfully run this on other machines without any issues, but for some reason, it's not working in my current environment.
Environment details:
Here are the relevant parts of the log: