huggingface / accelerate

πŸš€ A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.7k stars 937 forks source link

Multi-node training on 2 A100 machines. #609

Closed Aaryan369 closed 2 years ago

Aaryan369 commented 2 years ago

Hi, I am trying to pretrain a wav2vec2 model on custom dataset am trying to run it on multiple Azure A100 virtual machines. Each machine has 8 GPUs (16GPUs in total). Both the machines only have private IPs and are present in the same subnet. I have opened the 5000 port for communication purpose.

1) This is how I am setting up accelerate config on the first machine (IP: 10.0.0.6):

In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2
How many different machines will you use (use more than 1 for multi-node training)? [1]: 2
What is the rank of this machine (from 0 to the number of machines - 1 )? [0]: 0
What is the IP address of the machine that will host the main process? 10.0.0.6
What is the port you will use to communicate with the main process? 5000
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]:
What should be your DeepSpeed's ZeRO optimization stage (0, 1, 2, 3)? [2]: 2
Where to offload optimizer states? [none/cpu/nvme]: cpu
Where to offload parameters? [none/cpu/nvme]: cpu
How many gradient accumulation steps you're passing in your script? [1]: 4
Do you want to use gradient clipping? [yes/NO]: no
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: no
Which Type of launcher do you want to use [0] pdsh, [1] standard, [2] openmpi, [3] mvapich)? [0]: 2
DeepSpeed configures multi-node compute resources with hostfile. Each row is of the format `hostname slots=[num_gpus]`, e.g., `localhost slots=2`; for more information please refer official [documentation](https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node). Please specify the location of hostfile: /job/hostfile
Do you want to specify exclusion filter string? [yes/NO]: no
Do you want to specify inclusion filter string? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:-1
Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]:

For accelerate config on second machine the only thing I am changing is the rank, which I set to '1'

The /job/hostfile contains:

genca1001 slots=8
genca1002 slots=8

The genca1001 and genca1002 are the ssh aliases that I defined in ~/.ssh/config

After this setup when I run the command accelerate launch w2v2_pretrain.py I get this error:

(w2v2) azureuser@Genc-A100-VM:/media/disk_1/aaryan/w2v2/pretrain/w2v2$ accelerate launch w2v2_pretrain.py
[2022-08-08 15:28:51,480] [INFO] [runner.py:378:main] Using IP address of 10.0.0.6 for node genca1001
[2022-08-08 15:28:51,482] [INFO] [runner.py:457:main] cmd = mpirun -n 16 -hostfile /job/hostfile --mca btl ^openib --mca btl_tcp_if_include eth0 -x UCX_TLS=tcp -x PYTHONPATH=/anaconda/envs/w2v2/bin/python -x CONDA_BACKUP_RANLIB=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-ranlib -x CONDA_SHLVL=2 -x LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2018.1.163/linux/tbb/lib/intel64_lin/gcc4.7:/opt/intel/compilers_and_libraries_2018.1.163/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin::/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64/ -x CONDA_EXE=/anaconda/bin/conda -x CONDA_BACKUP_CXX_FOR_BUILD=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-c++ -x CONDA_BACKUP_OBJCOPY=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-objcopy -x CONDA_BACKUP_AR=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-ar -x CONDA_BACKUP_AS=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-as -x LANG=C.UTF-8 -x AZURE_EXTENSION_DIR=/opt/az/extensions -x DISPLAY=localhost:11.0 -x JULIA_DEPOT_PATH=/opt/julia/latest/packages/ -x CONDA_BACKUP_host_alias=x86_64-conda-linux-gnu -x NODE_PATH=/usr/lib/node_modules -x CONDA_BACKUP_CC=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-cc -x CONDA_PREFIX=/anaconda/envs/w2v2 -x _CE_M= -x SCALA_HOME=/usr/share/scala -x XDG_SESSION_ID=9117 -x CONDA_BACKUP_STRIP=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-strip -x USER=azureuser -x CONDA_PREFIX_1=/anaconda/envs/py38_default -x PWD=/media/disk_1/aaryan/w2v2/pretrain/w2v2 -x HOME=/home/azureuser -x CONDA_PYTHON_EXE=/anaconda/bin/python -x CONDA_BACKUP_GCC_RANLIB=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-gcc-ranlib -x CPATH=/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/include: -x CUPIT_LIB_PATH=/usr/local/cuda/extras/CUPTI/lib64/ -x XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop -x CONDA_BACKUP_STRINGS=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-strings -x CONDA_BACKUP_CXXFILT=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-c++filt -x CONDA_BACKUP_SIZE=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-size -x CONDA_BACKUP_HOST=x86_64-conda-linux-gnu -x _CE_CONDA= -x NLSPATH=/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin/locale/%l_%t/%N -x CUDA_ROOT=/usr/local/cuda -x SPARK_HOME=/dsvm/tools/spark/current -x LIBRARY_PATH=/opt/intel/compilers_and_libraries_2018.1.163/linux/tbb/lib/intel64_lin/gcc4.7:/opt/intel/compilers_and_libraries_2018.1.163/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64/: -x CONDA_BACKUP_READELF=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-readelf -x CONDA_BACKUP_CPP=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-cpp -x CONDA_BACKUP_LD=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-ld -x SSH_TTY=/dev/pts/2 -x CONDA_BACKUP_CXX=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-c++ -x MAIL=/var/mail/azureuser -x TERM=xterm -x SHELL=/bin/bash -x CONDA_BACKUP_GPROF=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-gprof -x CONDA_BACKUP_ADDR2LINE=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-addr2line -x CONDA_BACKUP_BUILD=x86_64-conda-linux-gnu -x CONDA_BACKUP_ELFEDIT=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-elfedit -x SHLVL=1 -x CONDA_BACKUP_build_alias=x86_64-conda-linux-gnu -x CONDA_BACKUP_CMAKE_PREFIX_PATH=/anaconda/envs/py38_default:/anaconda/envs/py38_default/x86_64-conda-linux-gnu/sysroot/usr -x LOGNAME=azureuser -x DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus -x XDG_RUNTIME_DIR=/run/user/1000 -x CONDA_BACKUP_GXX=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-g++ -x CONDA_BACKUP_GCC_NM=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-gcc-nm -x PYSPARK_PYTHON=/anaconda/envs/py38_default/bin/python -x PATH=/anaconda/envs/w2v2/bin:/anaconda/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/home/azureuser/.dotnet/tools:/dsvm/tools/spark/current/bin -x CONDA_BACKUP_CC_FOR_BUILD=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-cc -x CONDA_BACKUP__CONDA_PYTHON_SYSCONFIGDATA_NAME=_sysconfigdata_x86_64_conda_linux_gnu -x CONDA_BACKUP_GCC=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-gcc -x PKG_CONFIG_PATH=/opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/bin/pkgconfig: -x CONDA_DEFAULT_ENV=w2v2 -x CONDA_BACKUP_CONDA_BUILD_SYSROOT=/anaconda/envs/py38_default/x86_64-conda-linux-gnu/sysroot -x CONDA_BACKUP_OBJDUMP=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-objdump -x CONDA_BACKUP_GCC_AR=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-gcc-ar -x CONDA_BACKUP_NM=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-nm -x CONDA_BACKUP_LD_GOLD=/anaconda/envs/py38_default/bin/x86_64-conda-linux-gnu-ld.gold -x _=/anaconda/envs/w2v2/bin/accelerate -x MIXED_PRECISION=no -x USE_DEEPSPEED=true -x DEEPSPEED_ZERO_STAGE=2 -x GRADIENT_ACCUMULATION_STEPS=4 -x GRADIENT_CLIPPING=none -x DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE=cpu -x DEEPSPEED_OFFLOAD_PARAM_DEVICE=cpu -x DEEPSPEED_ZERO3_INIT=false -x DEEPSPEED_ZERO3_SAVE_16BIT_MODEL=none -x DEEPSPEED_CONFIG_FILE=none -x OLDPWD=/home/azureuser /anaconda/envs/w2v2/bin/python -u w2v2_pretrain.py
/anaconda/envs/w2v2/bin/python: can't open file 'w2v2_pretrain.py': [Errno 2] No such file or directory
/anaconda/envs/w2v2/bin/python: can't open file 'w2v2_pretrain.py': [Errno 2] No such file or directory
/anaconda/envs/w2v2/bin/python: can't open file 'w2v2_pretrain.py': [Errno 2] No such file or directory
/anaconda/envs/w2v2/bin/python: can't open file 'w2v2_pretrain.py': [Errno 2] No such file or directory
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
/anaconda/envs/w2v2/bin/python: can't open file 'w2v2_pretrain.py': [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/anaconda/envs/w2v2/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/anaconda/envs/w2v2/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/anaconda/envs/w2v2/lib/python3.8/site-packages/accelerate/commands/launch.py", line 674, in launch_command
    deepspeed_launcher(args)
  File "/anaconda/envs/w2v2/lib/python3.8/site-packages/accelerate/commands/launch.py", line 444, in deepspeed_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['deepspeed', '--no_local_rank', '--hostfile', '/job/hostfile', '--launcher', 'openmpi', '--num_gpus', '-1', 'w2v2_pretrain.py']' returned non-zero exit status 2.

Using pdsh and mvapich launcher also results in similar error of file not found.

The w2v2_pretrain file is present in the same directory from which I am running the accelerate launch command. I have also tried passing the complete path to the python file when calling the accelerate launch command but it still results in the same error.

2) I have tried the same process by using the standard launcher provided in accelerate config Which Type of launcher do you want to use [0] pdsh, [1] standard, [2] openmpi, [3] mvapich)? [0]: 1

This gave the error:

[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

After this, the processes in both the machines get killed.

I would like to know how I could solve any of these errors. Thanks in advance.

pacman100 commented 2 years ago

Hello @Aaryan369, when using the standard launcher, I hope you are launching the script on both nodes as you would typically do when using torchrun in the multinode setting. I found this comment mentioning the same error you got when using the standard launcher: https://github.com/huggingface/accelerate/issues/412#issuecomment-1180127412. In the following comment, they also mention how that issue got solved with details. Can you please try that out and let us know. They ran the below command:

NCCL_IB_GID_INDEX=3 NCCL_DEBUG=INFO accelerate launch script.py
pacman100 commented 2 years ago

For other launchers, could you try having hostfile with the below content. Also, are you able to ssh from one machine to another (successfully running ssh genca1002 from genca1001 node)?

localhost slots=8
genca1002 slots=8

Post having above hostfile, run below sample code (sample.py) to see if multi-node setup is working fine

import os
import torch
import deepspeed
import accelerate

with open('sample_file.txt', 'w') as f:
        f.write("using pdsh for distributed setup is a success!")

The command to run and check if sample_file.txt os created on both the nodes:

deepspeed --hostfile hostfile sample.py
Aaryan369 commented 2 years ago

Yes, I have been able to ssh from one machine to another using the aliases ssh genca1001 and ssh genca1002.

I tried running the sample.py using the command deepspeed --hostfile /job/hostfile sample.py after changing the hostfile as per your recommendation to

localhost slots=8
genca1002 slots=8

This was the error I received:

Traceback (most recent call last):
  File "/anaconda/envs/w2v2/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 362, in main
    subprocess.check_call(
  File "/anaconda/envs/w2v2/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'ssh -o PasswordAuthentication=no localhost hostname' returned non-zero exit status 255.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/anaconda/envs/w2v2/bin/deepspeed", line 6, in <module>
    main()
  File "/anaconda/envs/w2v2/lib/python3.8/site-packages/deepspeed/launcher/runner.py", line 368, in main
    raise RuntimeError(
RuntimeError: Using hostfile at /job/hostfile but host=localhost was not reachable via ssh. If you are running with a single node please remove /job/hostfile or setup passwordless ssh.

I changed the hostfile back, and ran the command deepspeed --hostfile /job/hostfile sample.py. This worked and generated the sample_file.txt in both machines.

Aaryan369 commented 2 years ago

The initial error I pointed out while running pdsh /anaconda/envs/w2v2/bin/python: can't open file 'w2v2_pretrain.py': [Errno 2] No such file or directory was resolved. This error came because the file was present in different paths on the 2 machines. I resolved this after placing the w2v2_pretrain.py file in the same path on both the machines.
Now, when I run the command accelerate launch w2v2_pertrain.py on my main machine using the hostfile

genca1001 slots=8
genca1002 slots=8

the code runs without any error till the line model, optimizer, train_dataloader = accelerator.prepare(model, optimizer, train_dataloader) After which there is no further output or logs being displayed on the screen. And also some memory in each of the 16 GPUs is being loaded.

This is the last output on the screen:

genca1002: Loading extension module utils...
genca1002: Time to load utils op: 0.20236802101135254 seconds
genca1001: Using /home/azureuser/.cache/torch_extensions as PyTorch extensions root...
genca1001: Using /home/azureuser/.cache/torch_extensions as PyTorch extensions root...
genca1002: Loading extension module utils...
genca1002: Time to load utils op: 0.30210161209106445 seconds

I have waited for nearly 30 minutes but there has been no change in the output or the GPU utilization.

pacman100 commented 2 years ago

I have waited for nearly 30 minutes but there has been no change in the output or the GPU utilization.

Is there GPU utilization along with GPU memory usage? This is weird as there is no error. Do the 2 nodes use slow interconnects, then it might be training but is infeasibly slow due to it. Can't help much without any errors, logs or info on the usage of GPUs and CPUs. Also, you seem to be doing CPU offloading, so can you please check CPU usage? CPU offloading would be slow by itself and across nodes might become too slow.

Try printing loss at each step so that you get a log if training is happening. For example, in the setup that I was using for development, the interconnects between nodes were slow and hence I had printing statements to make sure training started:

localhost: Using /home/sourab/.cache/torch_extensions/py38_cu102 as PyTorch extensions root...
localhost: No modifications detected for re-loaded extension module utils, skipping build step...
localhost: Loading extension module utils...
localhost: Time to load utils op: 0.0002067089080810547 seconds 
localhost: after prepare
localhost: Distributed environment: DEEPSPEED  Backend: nccl
localhost: Num processes: 4
localhost: Process index: 0
localhost: Local process index: 0
localhost: Device: cuda:0
localhost: ds_config: {'train_batch_size': 64, 'train_micro_batch_size_per_gpu': 16, 'gradient_accumulation_steps': 1, 'zero_op
timization': {'stage': 2, 'offload_optimizer': {'device': 'none'}, 'offload_param': {'device': 'none'}, 'stage3_gather_16bit_we
ights_on_model_save': False}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'zero_allow_untested_optimizer': True}
localhost: 
brutasse: loss: 0.44637179374694824, step: 0
localhost: loss: 0.6673623919487, step: 0
brutasse: loss: 0.5470665693283081, step: 0
localhost: loss: 0.47117069363594055, step: 0
Aaryan369 commented 2 years ago

No, I just ran the code with the command NCCL_IB_GID_INDEX=3 NCCL_DEBUG=INFO accelerate launch script.py

This is the output on first machine:

Genc-A100-VM:36759:36970 [0] include/socket.h:421 NCCL WARN Call to recv failed : Connection reset by peer
Genc-A100-VM:36759:36970 [0] NCCL INFO transport/net_ib.cc:593 -> 2
Genc-A100-VM:36759:36970 [0] NCCL INFO transport/net_ib.cc:734 -> 2
Genc-A100-VM:36759:36970 [0] NCCL INFO include/net.h:26 -> 2
Genc-A100-VM:36759:36970 [0] NCCL INFO transport/net.cc:348 -> 2
Genc-A100-VM:36759:36970 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

Genc-A100-VM:36755:36969 [0] include/socket.h:421 NCCL WARN Call to recv failed : Connection reset by peer
Genc-A100-VM:36755:36969 [0] NCCL INFO transport/net_ib.cc:593 -> 2
Genc-A100-VM:36755:36969 [0] NCCL INFO transport/net_ib.cc:734 -> 2
Genc-A100-VM:36755:36969 [0] NCCL INFO include/net.h:26 -> 2
Genc-A100-VM:36755:36969 [0] NCCL INFO transport/net.cc:348 -> 2
Genc-A100-VM:36755:36969 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

and on the second machine:

Genc-A100-2:79451:79672 [0] transport/net_ib.cc:818 NCCL WARN NET/IB : Got completion with error 12, opcode 32657, len 32630, vendor err 129
Genc-A100-2:79451:79672 [0] NCCL INFO include/net.h:28 -> 2
Genc-A100-2:79451:79672 [0] NCCL INFO transport/net.cc:357 -> 2
Genc-A100-2:79451:79672 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]

Genc-A100-2:79449:79675 [0] transport/net_ib.cc:818 NCCL WARN NET/IB : Got completion with error 12, opcode 32682, len 32655, vendor err 129
Genc-A100-2:79449:79675 [0] NCCL INFO include/net.h:28 -> 2
Genc-A100-2:79449:79675 [0] NCCL INFO transport/net.cc:357 -> 2
Genc-A100-2:79449:79675 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
pacman100 commented 2 years ago

Strange that standard launcher is throwing NCCL errors but other DeepSpeed launchers are working fine πŸ˜…

pacman100 commented 2 years ago

I have waited for nearly 30 minutes but there has been no change in the output or the GPU utilization.

Is there GPU utilization along with GPU memory usage? This is weird as there is no error. Do the 2 nodes use slow interconnects, then it might be training but is infeasibly slow due to it. Can't help much without any errors, logs or info on the usage of GPUs and CPUs. Also, you seem to be doing CPU offloading, so can you please check CPU usage? CPU offloading would be slow by itself and across nodes might become too slow.

Try printing loss at each step so that you get a log if training is happening. For example, in the setup that I was using for development, the interconnects between nodes were slow and hence I had printing statements to make sure training started:

localhost: Using /home/sourab/.cache/torch_extensions/py38_cu102 as PyTorch extensions root...
localhost: No modifications detected for re-loaded extension module utils, skipping build step...
localhost: Loading extension module utils...
localhost: Time to load utils op: 0.0002067089080810547 seconds 
localhost: after prepare
localhost: Distributed environment: DEEPSPEED  Backend: nccl
localhost: Num processes: 4
localhost: Process index: 0
localhost: Local process index: 0
localhost: Device: cuda:0
localhost: ds_config: {'train_batch_size': 64, 'train_micro_batch_size_per_gpu': 16, 'gradient_accumulation_steps': 1, 'zero_op
timization': {'stage': 2, 'offload_optimizer': {'device': 'none'}, 'offload_param': {'device': 'none'}, 'stage3_gather_16bit_we
ights_on_model_save': False}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'zero_allow_untested_optimizer': True}
localhost: 
brutasse: loss: 0.44637179374694824, step: 0
localhost: loss: 0.6673623919487, step: 0
brutasse: loss: 0.5470665693283081, step: 0
localhost: loss: 0.47117069363594055, step: 0

Please let us know once you have more details as per these points

Aaryan369 commented 2 years ago

Is there GPU utilization along with GPU memory usage?

Yes the GPU utilization is 100%

This is weird as there is no error. Do the 2 nodes use slow interconnects, then it might be training but is infeasibly slow due to it. Can't help much without any errors, logs or info on the usage of GPUs and CPUs.

I don't think the interconnection is slow but I will try to leave it for a bit more time and check if that's the case.

CPU offloading would be slow by itself and across nodes might become too slow.

Let me remove CPU offloading and check

Try printing loss at each step so that you get a log if training is happening.

We are using tqdm to see the progress, but the progress bar itself is not loaded.

pacman100 commented 2 years ago

Yes, I also observed no progress bar being loaded in multi-node setup and hence I resorted to printing loss every n steps. When I killed the process on the second node followed by the parent node, it showed the progress bar with progress before exiting. As per your info then, training is happening as expected and I believe the slow interconnects + CPU offloading is the cause here.

pacman100 commented 2 years ago

No, I just ran the code with the command NCCL_IB_GID_INDEX=3 NCCL_DEBUG=INFO accelerate launch script.py

This is the output on first machine:

Genc-A100-VM:36759:36970 [0] include/socket.h:421 NCCL WARN Call to recv failed : Connection reset by peer
Genc-A100-VM:36759:36970 [0] NCCL INFO transport/net_ib.cc:593 -> 2
Genc-A100-VM:36759:36970 [0] NCCL INFO transport/net_ib.cc:734 -> 2
Genc-A100-VM:36759:36970 [0] NCCL INFO include/net.h:26 -> 2
Genc-A100-VM:36759:36970 [0] NCCL INFO transport/net.cc:348 -> 2
Genc-A100-VM:36759:36970 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

Genc-A100-VM:36755:36969 [0] include/socket.h:421 NCCL WARN Call to recv failed : Connection reset by peer
Genc-A100-VM:36755:36969 [0] NCCL INFO transport/net_ib.cc:593 -> 2
Genc-A100-VM:36755:36969 [0] NCCL INFO transport/net_ib.cc:734 -> 2
Genc-A100-VM:36755:36969 [0] NCCL INFO include/net.h:26 -> 2
Genc-A100-VM:36755:36969 [0] NCCL INFO transport/net.cc:348 -> 2
Genc-A100-VM:36755:36969 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

and on the second machine:

Genc-A100-2:79451:79672 [0] transport/net_ib.cc:818 NCCL WARN NET/IB : Got completion with error 12, opcode 32657, len 32630, vendor err 129
Genc-A100-2:79451:79672 [0] NCCL INFO include/net.h:28 -> 2
Genc-A100-2:79451:79672 [0] NCCL INFO transport/net.cc:357 -> 2
Genc-A100-2:79451:79672 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]

Genc-A100-2:79449:79675 [0] transport/net_ib.cc:818 NCCL WARN NET/IB : Got completion with error 12, opcode 32682, len 32655, vendor err 129
Genc-A100-2:79449:79675 [0] NCCL INFO include/net.h:28 -> 2
Genc-A100-2:79449:79675 [0] NCCL INFO transport/net.cc:357 -> 2
Genc-A100-2:79449:79675 [0] NCCL INFO proxy.cc:198 -> 2 [Proxy Thread]
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

I am unable to reproduce this issue. I retested the the setup that I have with the below config and everything is working as expected:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_multinode_launcher: standard
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0 # 1 in the second node
main_process_ip: 192.xxx.x.xx
main_process_port: 29500
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 4
use_cpu: false
Aaryan369 commented 2 years ago

Yes, I also observed no progress bar being loaded in multi-node setup and hence I resorted to printing loss every n steps. When I killed the process on the second node followed by the parent node, it showed the progress bar with progress before exiting. As per your info then, training is happening as expected and I believe the slow interconnects + CPU offloading is the cause here.

Hi, so I left the code to run over-night (without CPU offloading). These are the wandb logs Nothing else was printed on the console. I made sure I print the loss every step, even that didn't print anything. Do you think the model is running? If so is there any other way to check?

You can see the console output in the wandb logs, (pasted the same here for your reference)

pacman100 commented 2 years ago

Hello @Aaryan369 , in wandb GPU utilisation is 100% which should mean training is happening. Did you have print statements as I did to debug as I feel wandb metrics aren't being updated.

Could you share minimal script that I can try on the multinode setup that I have access to in order to reproduce the issue? Currently I am unable to reproduce the issue and as there is no error, it further makes it difficult to debug.

Could you also try running on single mode (8 GPU model in your case - Single mode Multi GPU setup) to make sure everything is running as expected?

Aaryan369 commented 2 years ago

Did you have print statements as I did to debug as I feel wandb metrics aren't being updated.

I added a print statement above and below this line

print('first print')
model, optimizer, train_dataloader = accelerator.prepare(model, optimizer, train_dataloader)
print('second print')

While the first print statement is getting executed, the second print statement is not getting executed.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

zhangvia commented 1 year ago

@Aaryan369 you can try to run the script on single node multi gpus. if you can run the script on single node single gpu,but the script hang on in nccl when you use multiple gpus(multinodes or single node),you can try export NCCL_P2P_DISABLE=1.if it works, this error may be caused by some hardware settings