Communication/NCCL failures training FSDP in multi-node environment with SLURM

jpgard commented 9 months ago

System Info

output of `accelerate env`: (note as shown below this prints the DEFAULT accelerate config and not the exact config being used for this job)
- `Accelerate` version: 0.27.2
- Platform: Linux-5.15.0-1037-aws-x86_64-with-glibc2.17
- Python version: 3.8.18
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.2.0+cu121 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 123.82 GB
- `Accelerate` default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: FSDP
    - mixed_precision: bf16
    - use_cpu: False
    - debug: False
    - num_processes: 8
    - machine_rank: 0
    - num_machines: 2
    - main_process_ip: 
    - main_process_port: 1234
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'FULL_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_transformer_layer_cls_to_wrap': 'LlamaDecoderLayer', 'fsdp_use_orig_params': False}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

I'm attempting to train a model with multi-node training, using SLURM scheduler. I am launching the job in 2 nodes with 8 GPUs each. My training script runs fine in a single-node environment with FSDP, and it starts fine in the multi-node setting -- until there is actual communication required between the nodes.

However, when the script gets to the parts that actually initialize multi-node training, it seems the processes are having issues communicating across nodes. I can see the logging output from all 16 processes, the data is loaded, etc. However, the script fails at accelerator.prepare().

Specifically I see the stack trace containing these lines (complete stack trace is below):


torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.

Note that it is possible I have misconfigured the accelerate config, or the SLURM settings (tasks/node counts etc), but based on the example here with corresponding FSDP config here this seems to be set up correctly to me.

Any thoughts would be appreciated, I've tried lots of different configurations and tinkering with the environment to make sure the versions of pytorch/NCCL/accelerate are all compatible as well.

Contents of fsdp_config_base.yaml I am using:

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
mixed_precision: bf16
rdzv_backend: c10d
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Relevant chunks of the sbatch script I am launching the job with:

#!/bin/bash

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=a40x
#SBATCH --gpus-per-node=8
#SBATCH --exclusive
#SBATCH --time=04-00:00:00
#SBATCH --account=nextgends
#SBATCH --chdir=/admin/home-jpgard/rtfm
#SBATCH --output=/admin/home-jpgard/rtfm/slurm-out/%j.out
#SBATCH --err=/admin/home-jpgard/rtfm/slurm-out/%j.err
#SBATCH --exclude=ip-10-0-201-106,ip-10-0-202-154

################# code block adapted from https://gist.github.com/pacman100/1cb1f17b2f1b3139a63b764263e70b25
set -x -e

# force crashing on nccl issues like hanging broadcast
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_DEBUG=INFO

GPUS_PER_NODE=8
NNODES=$SLURM_NNODES
NUM_PROCESSES=$(expr $NNODES \* $GPUS_PER_NODE)

# A function to parse slurm's node notation when it uses 'bracketed' values for SLURM_JOB_NODELIST
# Function to parse and expand node list from SLURM_JOB_NODELIST
expand_nodes() {
    # The input is something like "ip-10-0-231-[1,86]"
    local nodelist=$1

    # Replace '[' and ']' with space and split the string
    local base=$(echo $nodelist | sed -E 's/\[([0-9]+),([0-9]+)\]/ \1 \2 /')

    # Read into array
    read -a parts <<< "$base"

    # Check if we have three parts: prefix, start, end
    if [ ${#parts[@]} -eq 3 ]; then
        local prefix=${parts[0]}
        local start=${parts[1]}
        local end=${parts[2]}

        # Generate sequence
        for i in $(seq $start $end); do
            echo "${prefix}${i}"
            return # Return after first IP to mimic head node behavior
        done
    else
        # If the format does not include a range, just echo the input
        echo $nodelist
    fi
}

# Extract the first node name from SLURM_JOB_NODELIST
# This assumes the format "node-list: ip-10-0-209-157,ip-10-0-231-1" and extracts the first node name
echo "SLURM_JOB_NODELIST is $SLURM_JOB_NODELIST"
node_name=$(echo $SLURM_JOB_NODELIST | sed 's/node-list: //' | cut -d, -f1)

# Now, resolve this node name to an IP address
# Using getent ahosts (You can also use nslookup if getent does not work as expected)
MASTER_ADDR=$(getent ahosts $node_name | head -n 1 | awk '{print $1}')

# Check if we got an IP
if [ ! -z "$MASTER_ADDR" ]; then
    echo "Head node IP: $MASTER_ADDR"
else
    echo "Failed to resolve head node IP address"
    # Extract the first node name from SLURM_JOB_NODELIST and expand if needed
    node_name=$(expand_nodes $SLURM_JOB_NODELIST)
    # Now, resolve this node name to an IP address using getent ahosts
    MASTER_ADDR=$(getent ahosts $node_name | head -n 1 | awk '{print $1}')
    echo "Head node IP after parsing: $MASTER_ADDR"
fi

MASTER_PORT=6999

# OTHER LAUNCHERS CAN BE USED HERE
export LAUNCHER="/admin/home-jpgard/miniconda3/envs/rtfm/bin/accelerate launch \
    --config_file /admin/home-jpgard/rtfm/fsdp_config_base.yaml \
    --num_processes $NUM_PROCESSES \
    --main_process_ip $MASTER_ADDR \
    --num_machines $NNODES \
    --main_process_port $MASTER_PORT \
    --machine_rank \$SLURM_PROCID \
    "

echo "SLURM_JOB_ID is ${SLURM_JOB_ID}"

echo 'activating conda environment'
source /admin/home-jpgard/.bashrc
source /admin/home-jpgard/miniconda3/etc/profile.d/conda.sh
which conda
conda activate rtfm
which python

export PROGRAM="\
  scripts/train.py \
  --more-args-here
  --bf16 True \
  --use_amp \
"

export CMD="$LAUNCHER $PROGRAM"
echo "about to run ${CMD}"
/opt/slurm/bin/srun --jobid $SLURM_JOBID /usr/bin/bash -c "$CMD"

Full stack trace:

Traceback (most recent call last):
  File "scripts/train.py", line 582, in <module>
    main(
  File "scripts/train.py", line 206, in main
    model = accelerator.prepare(model)
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in prepare
    result = tuple(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1229, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1105, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1387, in prepare_model
    model = FSDP(model, **kwargs)
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 477, in __init__
    _auto_wrap(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
    _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 503, in __init__
    _init_param_handle_from_module(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 587, in _init_param_handle_from_module
    _sync_module_params_and_buffers(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 1068, in _sync_module_params_and_buffers
    _sync_params_and_buffers(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/utils.py", line 303, in _sync_params_and_buffers
    dist._broadcast_coalesced(
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
socketStartConnect: Connect to fe80::a849:bbff:fe73:19bd%veth8299cd8<54785> failed : Software caused connection abort
Traceback (most recent call last):
  File "scripts/train.py", line 582, in <module>
    main(
  File "scripts/train.py", line 206, in main
    model = accelerator.prepare(model)
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in prepare
    result = tuple(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1229, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1105, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1387, in prepare_model
    model = FSDP(model, **kwargs)
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 477, in __init__
    _auto_wrap(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
    _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs)  # type: ignore[arg-type]
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 503, in __init__
    _init_param_handle_from_module(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 587, in _init_param_handle_from_module
    _sync_module_params_and_buffers(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 1068, in _sync_module_params_and_buffers
    _sync_params_and_buffers(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/utils.py", line 303, in _sync_params_and_buffers
    dist._broadcast_coalesced(
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:

/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/utils/launch.py:192: FutureWarning: `fsdp_backward_prefetch_policy` is deprecated and will be removed in version 0.27.0 of 🤗 Accelerate. Use `fsdp_backward_prefetch` instead
  warnings.warn(
[2024-02-16 02:42:00,874] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025500 closing signal SIGTERM
[2024-02-16 02:42:00,875] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025502 closing signal SIGTERM
[2024-02-16 02:42:00,875] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025503 closing signal SIGTERM
[2024-02-16 02:42:00,875] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025504 closing signal SIGTERM
[2024-02-16 02:42:00,876] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025505 closing signal SIGTERM
[2024-02-16 02:42:00,876] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025506 closing signal SIGTERM
[2024-02-16 02:42:00,876] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025507 closing signal SIGTERM
/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/utils/launch.py:192: FutureWarning: `fsdp_backward_prefetch_policy` is deprecated and will be removed in version 0.27.0 of 🤗 Accelerate. Use `fsdp_backward_prefetch` instead
  warnings.warn(
[2024-02-16 02:42:00,885] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267868 closing signal SIGTERM
[2024-02-16 02:42:00,885] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267870 closing signal SIGTERM
[2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267872 closing signal SIGTERM
[2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267873 closing signal SIGTERM
[2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267874 closing signal SIGTERM
[2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267875 closing signal SIGTERM
[2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267876 closing signal SIGTERM
[2024-02-16 02:42:03,298] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1267869) of binary: /admin/home-jpgard/miniconda3/envs/rtfm/bin/python
Traceback (most recent call last):
  File "/admin/home-jpgard/miniconda3/envs/rtfm/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1010, in launch_command
    multi_gpu_launcher(args)
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 672, in multi_gpu_launcher
    distrib_run.run(args)
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-16_02:42:00
  host      : ip-10-0-209-157.us-west-2.compute.internal
  rank      : 9 (local_rank: 1)
  exitcode  : 1 (pid: 1267869)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Expected behavior

I expect training to work in the distributed setting just as it does in the single-node setting.

jpgard commented 9 months ago

A couple of additional comments:

I've tried using torchrun as the launcher instead (with export LAUNCHER="NCCL_DEBUG=INFO torchrun --nproc_per_node=$GPUS_PER_NODE --nnodes=$NNODES --node_rank=\$SLURM_PROCID --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT ") but it raises the same error
I am not 100% clear on whether NUM_PROCESSES should be set to 8 (the number of GPUs per node) or 16 (the total number of processes); I can't find this clearly documented in Accelerate either but may have missed something; the docs say "The total number of processes to be launched in parallel" but I suppose this could be in parallel on one node, or on all nodes.

muellerzr commented 9 months ago

What kind of GPUs are these?

jpgard commented 9 months ago

They are 40GB A100s, in nodes of 8

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

TuyetHan commented 8 months ago

Hi, I have the same problem. Did you find any way to fix it?

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / accelerate