microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
6k stars 1.02k forks source link

Initializing TorchBackend in DeepSpeed with backend nccl exits with return code = -11 #542

Closed CZT0 closed 1 year ago

CZT0 commented 1 year ago

Hello,

I am experiencing an issue when trying to initialize TorchBackend with DeepSpeed using the backend nccl. The operation fails with return code = -11.

Here are the steps to reproduce the issue:

  1. bash /home/sds/ustcllm/jellow/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_node/run_13b.sh

I have successfully run this on other machines without any issues, but for some reason, it's not working in my current environment.

Environment details:

Here are the relevant parts of the log:

[2023-05-22 13:46:52,710] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1,2,3: setting --include=localhost:0,1,2,3
[2023-05-22 13:46:52,744] [INFO] [runner.py:541:main] cmd = /home/sds/ustcllm/jellow/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None /home/sds/ustcllm/jellow/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets --data_split 2,4,4 --model_name_or_path facebook/opt-13b --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --max_seq_len 512 --learning_rate 1e-4 --weight_decay 0. --num_train_epochs 16 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --num_warmup_steps 0 --seed 1234 --gradient_checkpointing --zero_stage 3 --lora_dim 128 --lora_module_name decoder.layers. --deepspeed --output_dir /home/sds/ustcllm/jellow/DeepSpeedExamples/applications/DeepSpeed-Chat/run/13B/step1
[2023-05-22 13:46:55,272] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-05-22 13:46:55,272] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-05-22 13:46:55,272] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-05-22 13:46:55,272] [INFO] [launch.py:247:main] dist_world_size=4
[2023-05-22 13:46:55,272] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2023-05-22 13:46:59,328] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-05-22 13:47:02,284] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2430401
[2023-05-22 13:47:02,300] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2430402
[2023-05-22 13:47:02,301] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2430403
[2023-05-22 13:47:02,406] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2430404
[2023-05-22 13:47:02,513] [ERROR] [launch.py:434:sigkill_handler] ['/home/sds/ustcllm/jellow/venv/bin/python', '-u', '/home/sds/ustcllm/jellow/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py', '--local_rank=3', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-13b', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--max_seq_len', '512', '--learning_rate', '1e-4', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--gradient_checkpointing', '--zero_stage', '3', '--lora_dim', '128', '--lora_module_name', 'decoder.layers.', '--deepspeed', '--output_dir', '/home/sds/ustcllm/jellow/DeepSpeedExamples/applications/DeepSpeed-Chat/run/13B/step1'] exits with return code = -11
Jingkustc commented 1 year ago

Hi, I encounter the same issue as you when i use ustc-scc. My environment is very similar to yours. python==3.11.3 pytorch==2.0.1 (py3.11_cuda11.7_cudnn8.5.0_0) transformers==4.29.1 cuda==11.7

xiaotingyun commented 1 year ago

Try setting the following two environment variables 1、export PATH=/usr/local/cuda-12.0/bin:$PATH 2、export LD_LIBRARY_PATH=/usr/local/cuda-12.0/lib64:$LD_LIBRARY_PATH

abdulvirta commented 1 year ago

I'm also hitting the same exact error. Did anyone find a solution to this?

CZT0 commented 1 year ago

Try setting the following two environment variables 1、export PATH=/usr/local/cuda-12.0/bin:$PATH 2、export LD_LIBRARY_PATH=/usr/local/cuda-12.0/lib64:$LD_LIBRARY_PATH

I tried to set it up, but it still gives an error.

Jingkustc commented 1 year ago

Try setting the following two environment variables 1、export PATH=/usr/local/cuda-12.0/bin:$PATH 2、export LD_LIBRARY_PATH=/usr/local/cuda-12.0/lib64:$LD_LIBRARY_PATH

cuda installation path is /opt/cuda, I check the PATH and LD_LIBRARY_PATH:

>echo $PATH
/opt/cuda/11.7.1_515.65.01/bin:/opt/cuda/11.7.1_515.65.01/nvvm:/home/user/.conda/envs/user/bin:/opt/Anaconda3/2022.05/condabin:/usr/lpp/mmfs/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

>echo $LD_LIBRARY_PATH
/opt/cuda/11.7.1_515.65.01/lib64/stubs:/opt/cuda/11.7.1_515.65.01/lib64

>ls /opt/cuda/11.7.1_515.65.01/lib64/stubs|grep libnvidia
libnvidia-ml.so
libnvidia-ml.so.1

I noticed that there are 2 warnings may related to the error:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/home/user/.conda/envs/user/lib/python3.11/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
xiaotingyun commented 1 year ago

Try setting the following two environment variables 1、export PATH=/usr/local/cuda-12.0/bin:$PATH 2、export LD_LIBRARY_PATH=/usr/local/cuda-12.0/lib64:$LD_LIBRARY_PATH

I tried to set it up, but it still gives an error.

  1. If the problem of exits code with -11 still occurs, then I am not sure why it occurs.
  2. But if the error "RuntimeError: CUDA error: OS call failed or operation not supported on this OS" appears after setting, it is likely that the RAM is insufficient, try setting "offload_optimizer_device" to none
CZT0 commented 1 year ago

Hi everyone,

I have managed to resolve the issue I was experiencing when trying to initialize TorchBackend with DeepSpeed using the backend nccl. The operation was failing with return code = -11.

The solution was to disable the InfiniBand (IB) transport in NCCL by setting the NCCL_IB_DISABLE environment variable to 1. To do this, simply run the following command before executing your script:

export NCCL_IB_DISABLE=1

This command tells NCCL not to use InfiniBand as the transport method for communication. Disabling InfiniBand might be necessary in certain system configurations or when InfiniBand is not available.

After applying this solution, I was able to successfully run my script without any issues. If you are encountering a similar problem, I hope this solution helps you as well.

Best regards

PoloWitty commented 1 year ago

BTW, only use export NCCL_IB_DISABLE=1 do not work in my situation (2x8v100 on cluster machine). It can not fully disable IB. So I use the following cmd:

export NCCL_IB_DISABLE=1
export NCCL_IBEXT_DISABLE=1