huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.79k stars 943 forks source link

The network connection seems to be abnormal and tries to use IPv6. #2249

Closed tingxueronghua closed 9 months ago

tingxueronghua commented 9 months ago

System Info

accelerate version: 0.23.0
OS: a slightly modified one from CentOS 7.2.1511
python version: 3.9.7
numpy version: 1.25.0
torch version: 2.0.1
accelerate's configuration:
- `Accelerate` version: 0.23.0
- Platform: A modified one which is based on CentOS 7.2.1511
- Python version: 3.9.7
- Numpy version: 1.25.0
- PyTorch version (GPU?): 2.0.1 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 375.57 GB
- GPU type: Tesla V100-SXM2-32GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: fp16
        - use_cpu: False
        - debug: False
        - num_processes: 16
        - machine_rank: 0
        - num_machines: 2
        - main_process_ip: 9.91.4.251
        - main_process_port: 45459
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {'deepspeed_hostfile': 'hostfile', 'deepspeed_multinode_launcher': 'pdsh', 'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

Tasks

Reproduction

I noticed that it should be a problem of network configuration and not the problem in my training code, so I did not list my own scripts here.

I have two machines, IP as 9.91.4.251 (host) and 9.206.63.59. When I use accelerate launch, it returns :

[2023-12-14 11:07:40,281] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-12-14 11:07:43,183] [INFO] [runner.py:452:main] Using IP address of /root/cuda11.8.bashrc for node 9.91.4.251
[2023-12-14 11:07:43,187] [INFO] [multinode_runner.py:70:get_cmd] Running on the following workers: 9.91.4.251,9.206.63.59
[2023-12-14 11:07:43,187] [INFO] [runner.py:555:main] cmd = pdsh -S -f 1024 -w 9.91.4.251,9.206.63.59 export PYTHONUNBUFFERED=0; export NCCL_IB_DISABLE=1; export NCCL_IB_CUDA_SUPPORT=0; export PYTHONPATH=/xxx/ultimate_mllm/; export NVM_INC=/usr/local/app/.nvm/versions/node/v14.16.1/include/node; export HOSTNAME=yuc-ebe634-b6b45db8cabb42d4abc61e40ae99a853-worker-0; export NVM_CD_FLAGS=; export K8S_POD_NAME=yuc-ebe634-b6b45db8cabb42d4abc61e40ae99a853-worker-0; export $
UBERNETES_PORT=tcp://10.96.0.1:443; export KUBERNETES_PORT_443_TCP_PORT=443; export SHELL=/bin/bash; export TERM=screen; export HISTSIZE=3000; export CONDA_SHLVL=1; export KUBERNETES_SERVICE_PORT=443; export PROJ_TAG_VER=venus_official_image-v0.2.0.zip; 
export PRAJNA_SSH=true; export KUBERNETES_SERVICE_HOST=10.96.0.1; export SSH_TTY=/dev/pts/0; export LC_ALL=en_US.UTF-8; export NVM_DIR=/usr/local/app/.nvm; export USER=root; export LD_LIBRARY_PATH=:/usr/local/cuda-11.8/lib64:/usr/local/cuda-11.8/extras/$
UPTI/lib64/:/usr/local/cuda-11.8/compat/:/usr/local/cuda/compat/:/usr/local/cuda-11.8/lib64:/usr/local/cuda-11.8/extras/CUPTI/lib64/:/usr/local/cuda-11.8/compat/:/usr/local/cuda/compat/; export NVIDIA_VISIBLE_DEVICES=GPU-8c2ac87d-b8a6-2e71-a1a5-3a610501$
f41,GPU-286af66e-1fe5-e8dc-8e7f-0c321bef18d9,GPU-76946b52-deda-f519-0e80-0ee4e041463c,GPU-a9841bdb-a50e-99f0-077b-9692f48ce402,GPU-4fcc27bb-0fc1-0540-9594-581ae964157b,GPU-4cfc7961-2ef0-fed0-2b65-eb673c7eac12,GPU-f4cc4d76-e5b5-183a-323b-c1761302ce05,GPU$
e43526cd-7081-cd6a-5439-74dcbfd4515c; export K8S_POD_UID=de896cb1-2c57-48f2-a5cb-161e767a4e18; export CONDA_EXE=/data/miniconda3/bin/conda; export list_workers=list_workers; export NVIDIA_DRIVER_CAPABILITIES=video,compute,utility,graphics; export TMOUT=$
59200; export gpu_min=8; export TMUX=/tmp/tmux-0/default,2110,0; export _CE_CONDA=; export platform_type=venusprivate; export MAIL=/var/spool/mail/root; export PATH=/usr/local/bin:/usr/local/vim9/bin/:/root/.yarn/bin:/root/.config/yarn/global/node_modul$
s/.bin:/usr/local/app/.nvm/versions/node/v14.16.1/bin:/data/miniconda3/bin:/data/miniconda3/condabin:/data/go/bin:/data/jdk1.8.0_144/bin:/usr/local/nginx/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/cuda-11.8/bin:/root/.f$
:/data/apache-maven-3.6.3/bin:/usr/local/bin/light_agent:/usr/local/bin/list_workers:/root/bin; export _=/data/miniconda3/bin/accelerate; export GO_HOME=/data/go; export K8S_POD_IP=9.206.63.59; export CONDA_PREFIX=/data/miniconda3; export PWD=/xxx/ultimate_mll; export JAVA_HOME=/data/jdk1.8.0_144; export TST_HACK_BASH_SESSION_ID=7049743797740; export K8S_IS_GPU_CONTAINER=true; export LANG=zh_CN.UTF-8; export VC_FROM=Venus; export TMUX_PANE=%1; export ENV_CONTAINER_
MIN_RUN_SECONDS=10; export gpu_max=8; export HISTCONTROL=ignoredups; export _CE_M=; export HOME=/root; export M2_HOME=/data/apache-maven-3.6.3; export SHLVL=3; export KUBERNETES_PORT_443_TCP_PROTO=tcp; export KUBERNETES_SERVICE_PORT_HTTPS=443; export USE
R_PREPARE_SCRIPT=; export CONDA_PYTHON_EXE=/data/miniconda3/bin/python; export LOGNAME=root; export PRAJNA_HTTP_SERVER=cluster-state-server-gpu-cq-3:65000; export CVS_RSH=ssh; export CLASSPATH=.:/data/jdk1.8.0_144/lib; export NVM_BIN=/usr/local/app/.nvm/
versions/node/v14.16.1/bin; export CONDA_DEFAULT_ENV=base; export GENERIC_REPO_URL=https://mirrors.tencent.com/repository/generic/shadow_cv_ai/light_training; export KUBERNETES_PORT_443_TCP_ADDR=10.96.0.1; export PRAJNA_HTTP_SERVER2=; export KUBERNETES_P
ORT_443_TCP=tcp://10.96.0.1:443; export TF_CPP_MIN_LOG_LEVEL=1; export CUDA_MODULE_LOADING=LAZY; export ACCELERATE_MIXED_PRECISION=fp16; export ACCELERATE_CONFIG_DS_FIELDS=deepspeed_hostfile,deepspeed_multinode_launcher,gradient_accumulation_steps,offloa
d_optimizer_device,offload_param_device,zero3_init_flag,zero_stage; export ACCELERATE_USE_DEEPSPEED=true; export ACCELERATE_DEEPSPEED_ZERO_STAGE=2; export ACCELERATE_GRADIENT_ACCUMULATION_STEPS=1; export ACCELERATE_DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE=none
; export ACCELERATE_DEEPSPEED_OFFLOAD_PARAM_DEVICE=none; export ACCELERATE_DEEPSPEED_ZERO3_INIT=false;  cd /xxx; /data/miniconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyI5LjkxLjQuMjUx
IjogWzAsIDEsIDIsIDMsIDQsIDUsIDYsIDddLCAiOS4yMDYuNjMuNTkiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --node_rank=%n --master_addr=/root/cuda11.8.bashrc --master_port=45459 --argsxxx --save_steps=5000 --logging_dir=log --report_to=tensorboard
9.206.63.59: /root/cuda11.8.bashrc
9.206.63.59: /root/custom.bashrc
9.91.4.251: /root/cuda11.8.bashrc
9.91.4.251: /root/custom.bashrc
9.206.63.59: Now using node v12.18.3 (npm v6.14.6)
9.91.4.251: Now using node v12.18.3 (npm v6.14.6)
9.206.63.59: Now using node v14.16.1 (npm v6.14.12)
9.91.4.251: Now using node v14.16.1 (npm v6.14.12)
9.206.63.59: [2023-12-14 11:07:45,660] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.91.4.251: [2023-12-14 11:07:45,750] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.206.63.59: [2023-12-14 11:07:46,636] [INFO] [launch.py:138:main] 1 NCCL_IB_DISABLE=1
9.206.63.59: [2023-12-14 11:07:46,636] [INFO] [launch.py:138:main] 1 NCCL_IB_CUDA_SUPPORT=0
9.206.63.59: [2023-12-14 11:07:46,636] [INFO] [launch.py:145:main] WORLD INFO DICT: {'9.91.4.251': [0, 1, 2, 3, 4, 5, 6, 7], '9.206.63.59': [0, 1, 2, 3, 4, 5, 6, 7]}
9.206.63.59: [2023-12-14 11:07:46,636] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=8, node_rank=1
9.206.63.59: [2023-12-14 11:07:46,636] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'9.91.4.251': [0, 1, 2, 3, 4, 5, 6, 7], '9.206.63.59': [8, 9, 10, 11, 12, 13, 14, 15]})
9.206.63.59: [2023-12-14 11:07:46,636] [INFO] [launch.py:163:main] dist_world_size=16
9.206.63.59: [2023-12-14 11:07:46,636] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
9.91.4.251: [2023-12-14 11:07:46,756] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=1
9.91.4.251: [2023-12-14 11:07:46,756] [INFO] [launch.py:138:main] 0 NCCL_IB_CUDA_SUPPORT=0                                                                                                                                                           [82/1894]
9.91.4.251: [2023-12-14 11:07:46,757] [INFO] [launch.py:145:main] WORLD INFO DICT: {'9.91.4.251': [0, 1, 2, 3, 4, 5, 6, 7], '9.206.63.59': [0, 1, 2, 3, 4, 5, 6, 7]}
9.91.4.251: [2023-12-14 11:07:46,757] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=8, node_rank=0
9.91.4.251: [2023-12-14 11:07:46,757] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'9.91.4.251': [0, 1, 2, 3, 4, 5, 6, 7], '9.206.63.59': [8, 9, 10, 11, 12, 13, 14, 15]})
9.91.4.251: [2023-12-14 11:07:46,757] [INFO] [launch.py:163:main] dist_world_size=16
9.91.4.251: [2023-12-14 11:07:46,757] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
9.206.63.59: [2023-12-14 11:07:50,632] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.91.4.251: [2023-12-14 11:07:50,772] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.206.63.59: [2023-12-14 11:07:50,771] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
9.206.63.59: [2023-12-14 11:07:50,771] [INFO] [comm.py:594:init_distributed] cdb=None
9.206.63.59: [2023-12-14 11:07:50,852] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.91.4.251: [2023-12-14 11:07:50,864] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.91.4.251: [2023-12-14 11:07:50,915] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
9.91.4.251: [2023-12-14 11:07:50,916] [INFO] [comm.py:594:init_distributed] cdb=None
9.206.63.59: [2023-12-14 11:07:50,919] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.206.63.59: [2023-12-14 11:07:50,994] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
9.206.63.59: [2023-12-14 11:07:50,994] [INFO] [comm.py:594:init_distributed] cdb=None
9.91.4.251: [2023-12-14 11:07:51,010] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
9.91.4.251: [2023-12-14 11:07:51,010] [INFO] [comm.py:594:init_distributed] cdb=None
9.206.63.59: [2023-12-14 11:07:51,029] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.206.63.59: [2023-12-14 11:07:51,032] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.206.63.59: [2023-12-14 11:07:51,036] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.206.63.59: [2023-12-14 11:07:51,042] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.206.63.59: [2023-12-14 11:07:51,058] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
9.206.63.59: [2023-12-14 11:07:51,058] [INFO] [comm.py:594:init_distributed] cdb=None
9.206.63.59: [2023-12-14 11:07:51,078] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.91.4.251: [2023-12-14 11:07:51,114] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.91.4.251: [2023-12-14 11:07:51,135] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.206.63.59: [2023-12-14 11:07:51,172] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
9.206.63.59: [2023-12-14 11:07:51,172] [INFO] [comm.py:594:init_distributed] cdb=None
9.206.63.59: [2023-12-14 11:07:51,174] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
9.206.63.59: [2023-12-14 11:07:51,174] [INFO] [comm.py:594:init_distributed] cdb=None
9.206.63.59: [2023-12-14 11:07:51,177] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
9.206.63.59: [2023-12-14 11:07:51,178] [INFO] [comm.py:594:init_distributed] cdb=None
9.206.63.59: [2023-12-14 11:07:51,181] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
9.206.63.59: [2023-12-14 11:07:51,181] [INFO] [comm.py:594:init_distributed] cdb=None
9.91.4.251: [2023-12-14 11:07:51,212] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.91.4.251: [2023-12-14 11:07:51,215] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.206.63.59: [2023-12-14 11:07:51,219] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
9.206.63.59: [2023-12-14 11:07:51,220] [INFO] [comm.py:594:init_distributed] cdb=None
9.91.4.251: [2023-12-14 11:07:51,237] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.91.4.251: [2023-12-14 11:07:51,262] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
9.91.4.251: [2023-12-14 11:07:51,262] [INFO] [comm.py:594:init_distributed] cdb=None
9.91.4.251: [2023-12-14 11:07:51,270] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
9.91.4.251: [2023-12-14 11:07:51,275] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
9.91.4.251: [2023-12-14 11:07:51,275] [INFO] [comm.py:594:init_distributed] cdb=None
9.91.4.251: [2023-12-14 11:07:51,351] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
9.91.4.251: [2023-12-14 11:07:51,351] [INFO] [comm.py:594:init_distributed] cdb=None
9.91.4.251: [2023-12-14 11:07:51,354] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
9.91.4.251: [2023-12-14 11:07:51,354] [INFO] [comm.py:594:init_distributed] cdb=None
9.91.4.251: [2023-12-14 11:07:51,354] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
9.91.4.251: [2023-12-14 11:07:51,380] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
9.91.4.251: [2023-12-14 11:07:51,380] [INFO] [comm.py:594:init_distributed] cdb=None
9.91.4.251: [2023-12-14 11:07:51,412] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
9.91.4.251: [2023-12-14 11:07:51,412] [INFO] [comm.py:594:init_distributed] cdb=None
9.206.63.59: [W socket.cpp:601] [c10d] The IPv6 network addresses of (/root/cuda11.8.bashrc, 45459) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).
9.206.63.59: [W socket.cpp:601] [c10d] The IPv6 network addresses of (/root/cuda11.8.bashrc, 45459) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).
9.206.63.59: [W socket.cpp:601] [c10d] The IPv6 network addresses of (/root/cuda11.8.bashrc, 45459) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).
9.206.63.59: [W socket.cpp:601] [c10d] The IPv6 network addresses of (/root/cuda11.8.bashrc, 45459) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).
9.206.63.59: [W socket.cpp:601] [c10d] The IPv6 network addresses of (/root/cuda11.8.bashrc, 45459) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).
9.91.4.251: [W socket.cpp:601] [c10d] The IPv6 network addresses of (/root/cuda11.8.bashrc, 45459) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).
9.206.63.59: [W socket.cpp:601] [c10d] The IPv6 network addresses of (/root/cuda11.8.bashrc, 45459) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).
9.91.4.251: [W socket.cpp:601] [c10d] The IPv6 network addresses of (/root/cuda11.8.bashrc, 45459) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).
9.91.4.251: [W socket.cpp:601] [c10d] The IPv6 network addresses of (/root/cuda11.8.bashrc, 45459) cannot be retrieved (gai error: -3 - Temporary failure in name resolution).

The the logs repeat the IPv6 related error.

Expected behavior

I am sure this code could run on single machine.

I know it is unlikely to be the problem of accelerate package itself. But I have no idea how to debug it and even not to say fix it. Could you give me some suggestion about this?

tingxueronghua commented 9 months ago

I noticed that the second line of the log is quite abnormal

[2023-12-14 11:07:43,183] [INFO] [runner.py:452:main] Using IP address of /root/cuda11.8.bashrc for node 9.91.4.251

But I have no idea where is the runner.py

tingxueronghua commented 9 months ago

Verified that the network issue only occurs when deepspeed is required. I can successfully run the model when there is no deepspeed zero-2 setting.

SunMarc commented 9 months ago

cc @pacman100

tingxueronghua commented 9 months ago

@pacman100 Is there any more information I should provide? This should not be the problem of network configuration because I can run my program om multiple nodes without deepspeed.

tingxueronghua commented 9 months ago

I am quite confuse because I checked the documents but found there seems no detailed instructions about how to run accelerate on multiple nodes with deepspeed. I think I want to inquire whether this function could steadily run?

tingxueronghua commented 9 months ago

Sorry for interrupting. This is indeed a network issue which is caused by PyTorch.

https://github.com/pytorch/pytorch/issues/74824