yguo33 commented 12 months ago

When fine-tuning the 70b model, I always run into an error while loading the model. Usually, after loading 4 to 10 shards (totally15 shards), the following error occurs(see Error Message). I'm using two nodes, and on the first GPU of the first node, the memory usage is always a bit lower, as shown in the image below.

Error Message:

Warning: unknown parameter local_rank Clearing GPU cache for all ranks --> Running with torch dist debug set to detailWarning: unknown parameter l[rank14]:[W socket.cpp:432] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success). Traceback (most recent call last): File "examples/finetuning.py", line 8, in fire.Fire(main) File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, kwargs) File "/data/user/llama-recipes/src/llama_recipes/finetuning.py", line 325, in main model = FSDP( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 476, in init _auto_wrap( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap _recursive_wrap(recursive_wrap_kwargs, root_kwargs) # type: ignore[arg-type] File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap wrapped_child, num_wrapped_params = _recursive_wrap( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap wrapped_child, num_wrapped_params = _recursive_wrap( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap wrapped_child, num_wrapped_params = _recursive_wrap( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap return _wrap(module, wrapper_cls, kwargs), nonwrapped_numel File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap return wrapper_cls(module, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 502, in init _init_param_handle_from_module( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 587, in _init_param_handle_from_module _sync_module_params_and_buffers( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 1068, in _sync_module_params_and_buffers _sync_params_and_buffers( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/utils.py", line 303, in _sync_params_and_buffers dist._broadcast_coalesced( torch.distributed.DistBackendError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout

GPU Usage: Every time, the first gpu of master node only using 3m until crashed. img_v3_0259_f1274ba2-21c7-4f00-921a-0464b49ef2eg

Env

cuda-python 11.7.0+0.g95a2041.dirty cupy-cuda118 11.0.0 dask-cuda 22.10.0a0+23.g62a1ee8 nvidia-dali-cuda110 1.20.0 pytorch-quantization 2.1.2 pytorch-triton 2.1.0+6e4932cda8 torch 2.2.0.dev20231116+cu118 torch-tensorrt 1.3.0a0 torchaudio 2.2.0.dev20231116+cu118 torchdata 0.6.1 torchtext 0.15.2+cpu torchvision 0.17.0.dev20231116+cu118 transformers 4.35.0

Training script

export NCCL_IB_HCA=mlx5 export NCCL_IB_TC=136 export NCCL_IB_SL=5 export NCCL_IB_GID_INDEX=3 export NCCL_SOCKET_IFNAME=bond0 export NCCL_DEBUG=INFO ... cd /llama-recipes

torchrun --nproc_per_node=${KUBERNETES_CONTAINER_RESOURCE_GPU} \ --master_addr=${MASTER_ADDR} \ --master_port=${MASTER_PORT} \ --nnodes=${WORLD_SIZE} \ --node_rank=${RANK} \ examples/finetuning.py \ --enable_fsdp \ --low_cpu_fsdp \ --fsdp_config.pure_bf16 \ --model_name /airoboros-l2-70b-2.1 \ --batch_size_training 1 \ --dist_checkpoint_root_folder /checkpoints \ --dist_checkpoint_folder fine-tuned \ --dataset "alpaca_dataset" 2>&1 | tee t44_lr.log

Has anyone else encountered a similar problem? Do you know what might be causing this? Thanks.

HamidShojanazeri commented 11 months ago

@yguo33 I wonder if you run into the same issue with a slurm script as well?

giaosudau commented 11 months ago

I ran on single node with 16xA100-40GB and having same issue

torch.distributed.DistBackendError: [9] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout

avanindra commented 11 months ago

@HamidShojanazeri , I run with slurm on a compute cluster with 4 nodes (8 A100). I face the same issue. Note that it happens for 70B model with low_cpu_fsdp, it does not happen for smaller model 7B, 13B (with low_cpu_fsdp).

avanindra commented 11 months ago

Apparently, I needed to export following NCCL env variable in slurm submission script:

export NCCL_ASYNC_ERROR_HANDLING=1

this fixed the NCCL socket time out issue in my case.

HamidShojanazeri commented 11 months ago

@avanindra thanks for the update, @giaosudau @yguo33 does that work for you too?

xuefeicao commented 10 months ago

@HamidShojanazeri, this still happens to me even after i use export NCCL_ASYNC_ERROR_HANDLING=1

tginart commented 7 months ago

Hi @HamidShojanazeri

I am also seeing this issue. I have tried both export NCCL_ASYNC_ERROR_HANDLING=1 and export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 but I still get the error:

torch.distributed.DistBackendError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout

Any thoughts?

HamidShojanazeri commented 6 months ago

@tginart can you please share a repro, your command, your env ( some specifications) + GPU type.

gonggaohan commented 6 months ago

@yguo33 hello, same issue, is your issue solved?

congcongke commented 3 months ago

set "CUDA_DEVICE_MAX_CONNECTIONS" to 32 maybe you need in environment. pls have a try @yguo33 @gonggaohan @tginart

terminator123 commented 2 months ago

set "CUDA_DEVICE_MAX_CONNECTIONS" to 32 maybe you need in environment. pls have a try @yguo33 @gonggaohan @tginart

RuntimeError: Using sequence parallelism requires setting the environment variable CUDA_DEVICE_MAX_CONNECTIONS to 1, when i set CUDA_DEVICE_MAX_CONNECTIONS to 32, it rases a new error

init27 commented 2 months ago

@tginart please let us know if you would still be interested in sharing some more details for us to repro. Thanks!

meta-llama / llama-recipes

NCCL communicator error: Socket Timeout when finetuning 70B model on 2 * (8* A100(80G)) #303

Error Message:

Env

Training script