Closed yguo33 closed 2 months ago
@yguo33 I wonder if you run into the same issue with a slurm script as well?
I ran on single node with 16xA100-40GB and having same issue
torch.distributed.DistBackendError: [9] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
@HamidShojanazeri , I run with slurm on a compute cluster with 4 nodes (8 A100). I face the same issue. Note that it happens for 70B model with low_cpu_fsdp, it does not happen for smaller model 7B, 13B (with low_cpu_fsdp).
Apparently, I needed to export following NCCL env variable in slurm submission script:
export NCCL_ASYNC_ERROR_HANDLING=1
this fixed the NCCL socket time out issue in my case.
@avanindra thanks for the update, @giaosudau @yguo33 does that work for you too?
@HamidShojanazeri, this still happens to me even after i use export NCCL_ASYNC_ERROR_HANDLING=1
Hi @HamidShojanazeri
I am also seeing this issue. I have tried both export NCCL_ASYNC_ERROR_HANDLING=1
and export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
but I still get the error:
torch.distributed.DistBackendError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Any thoughts?
@tginart can you please share a repro, your command, your env ( some specifications) + GPU type.
@yguo33 hello, same issue, is your issue solved?
set "CUDA_DEVICE_MAX_CONNECTIONS" to 32 maybe you need in environment. pls have a try @yguo33 @gonggaohan @tginart
set "CUDA_DEVICE_MAX_CONNECTIONS" to 32 maybe you need in environment. pls have a try @yguo33 @gonggaohan @tginart
RuntimeError: Using sequence parallelism requires setting the environment variable CUDA_DEVICE_MAX_CONNECTIONS to 1, when i set CUDA_DEVICE_MAX_CONNECTIONS to 32, it rases a new error
@tginart please let us know if you would still be interested in sharing some more details for us to repro. Thanks!
When fine-tuning the 70b model, I always run into an error while loading the model. Usually, after loading 4 to 10 shards (totally15 shards), the following error occurs(see Error Message). I'm using two nodes, and on the first GPU of the first node, the memory usage is always a bit lower, as shown in the image below.
Error Message:
Warning: unknown parameter local_rank Clearing GPU cache for all ranks --> Running with torch dist debug set to detailWarning: unknown parameter l[rank14]:[W socket.cpp:432] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success). Traceback (most recent call last): File "examples/finetuning.py", line 8, in fire.Fire(main)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, kwargs)
File "/data/user/llama-recipes/src/llama_recipes/finetuning.py", line 325, in main model = FSDP(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 476, in init
_auto_wrap(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
_recursive_wrap(recursive_wrap_kwargs, root_kwargs) # type: ignore[arg-type]
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
return _wrap(module, wrapper_cls, kwargs), nonwrapped_numel
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap return wrapper_cls(module, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 502, in init
_init_param_handle_from_module(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 587, in _init_param_handle_from_module _sync_module_params_and_buffers(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 1068, in _sync_module_params_and_buffers
_sync_params_and_buffers(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/utils.py", line 303, in _sync_params_and_buffers dist._broadcast_coalesced(
torch.distributed.DistBackendError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
GPU Usage: Every time, the first gpu of master node only using 3m until crashed.
Env
cuda-python 11.7.0+0.g95a2041.dirty cupy-cuda118 11.0.0 dask-cuda 22.10.0a0+23.g62a1ee8 nvidia-dali-cuda110 1.20.0 pytorch-quantization 2.1.2 pytorch-triton 2.1.0+6e4932cda8 torch 2.2.0.dev20231116+cu118 torch-tensorrt 1.3.0a0 torchaudio 2.2.0.dev20231116+cu118 torchdata 0.6.1 torchtext 0.15.2+cpu torchvision 0.17.0.dev20231116+cu118 transformers 4.35.0
Training script
export NCCL_IB_HCA=mlx5 export NCCL_IB_TC=136 export NCCL_IB_SL=5 export NCCL_IB_GID_INDEX=3 export NCCL_SOCKET_IFNAME=bond0 export NCCL_DEBUG=INFO ... cd /llama-recipes
torchrun --nproc_per_node=${KUBERNETES_CONTAINER_RESOURCE_GPU} \ --master_addr=${MASTER_ADDR} \ --master_port=${MASTER_PORT} \ --nnodes=${WORLD_SIZE} \ --node_rank=${RANK} \ examples/finetuning.py \ --enable_fsdp \ --low_cpu_fsdp \ --fsdp_config.pure_bf16 \ --model_name /airoboros-l2-70b-2.1 \ --batch_size_training 1 \ --dist_checkpoint_root_folder /checkpoints \ --dist_checkpoint_folder fine-tuned \ --dataset "alpaca_dataset" 2>&1 | tee t44_lr.log
Has anyone else encountered a similar problem? Do you know what might be causing this? Thanks.