microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.9k stars 3.97k forks source link

[BUG] Failed for using cpu for pipeline based training across multiple machines (2 machines actually) #5313

Open xuanhua opened 3 months ago

xuanhua commented 3 months ago

Describe the bug

I have two ubuntu machines, and with 10Gb/s erthnet cable connected and I want to use deepspeed to use these two machines to run a model training with pipeline parallel, and only with cpu to do the training.

Both of these machines installed pytorch version 1.13.1. And: deepspeed cpu accelerator related dependencies are installed like this: 1) pip install intel-extension-for-pytorch==1.13.100 2) python -m pip install oneccl_bind_pt==1.13 -f https://developer.intel.com/ipex-whl-stable-cpu 3) git clone https://github.com/oneapi-src/oneCCL # And do cmake based compiling and installation. And the deepspeed version is: 0.14.0

And The command line for launching the training:

deepspeed --master_addr=192.168.23.110  --num_nodes=2  --hostfile=./hostfile_linux  --master_port=29555  pipeline_model.py

And the content of ./hostfile_linux:

worker-alienbook-wl slots=1
worker-3090ti-wl slots=1

After changing various arguments, building soft links by 'ln -s' to create the same accessing path for files and directory on both machines, And even change some installed library code (like torch and deepspeed). I thought it should work finally. But I could I'm still stuck by one line in my own pipeline_model.py

deepspeed.init_distributed(dist_backend="ccl")

Both of my machines could enter this function, but they never return from this function .

I tried my best to investigate the issue, but no break through. Any help is appreciated. And maybe I should report this issue in pytorch intel extension project. But it looks like there is not much people there :(. After stucking more than 120 seconds, some backtraces are displayed.

Here is the related log messages arround:

----- omit some unrelated messages -----
worker-alienbook-wl: Using /home/axu/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
worker-alienbook-wl: Emitting ninja build file /home/axu/.cache/torch_extensions/py39_cu116/deepspeed_ccl_comm/build.ninja...
worker-alienbook-wl: Building extension module deepspeed_ccl_comm...
worker-alienbook-wl: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
worker-alienbook-wl: ninja: no work to do.
worker-alienbook-wl: Loading extension module deepspeed_ccl_comm...
worker-alienbook-wl: Time to load deepspeed_ccl_comm op: 0.09088420867919922 seconds
worker-alienbook-wl: DeepSpeed deepspeed.ops.comm.deepspeed_ccl_comm_op built successfully
worker-alienbook-wl: 2024-03-26 18:28:02,528 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
worker-alienbook-wl: 2024-03-26 18:28:02,530 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
worker-alienbook-wl: 2024:03:26-18:28:02:(43479) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
worker-3090ti-wl: 2024-03-26 18:28:02,591 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
worker-3090ti-wl: 2024-03-26 18:28:02,598 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
worker-3090ti-wl: 2024:03:26-18:28:02:(75330) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
worker-3090ti-wl: write: error: buf 0x558f0a2e4310, size 394, shift 0
worker-3090ti-wl: 2024:03:26-18:30:13:(75330) |CCL_ERROR| internal_kvs.cpp:529 kvs_init: connection time (131) >= limit (120)
worker-3090ti-wl: 2024:03:26-18:30:13:(75330) |CCL_ERROR| internal_kvs_server.hpp:66 put: read/write error: Broken pipe
worker-3090ti-wl: 2024:03:26-18:30:13:(75330) |CCL_ERROR| internal_kvs.cpp:108 kvs_get_value_by_name_key: client: get_value
worker-3090ti-wl: 2024:03:26-18:30:13:(75330) |CCL_ERROR| pmi_resizable_simple_internal.cpp:319 get_local_kvs_id: failed to get local kvs id
worker-3090ti-wl: 2024:03:26-18:30:13:(75330) |CCL_ERROR| pmi_resizable_simple_internal.cpp:65 pmrt_init: failed to get local id
worker-3090ti-wl: 2024:03:26-18:30:13:(75330) |CCL_ERROR| atl_ofi_comm.cpp:274 init_transport: pmi init failed
worker-3090ti-wl: 2024:03:26-18:30:13:(75330) |CCL_ERROR| atl_ofi_comm.cpp:79 atl_ofi_comm: condition init_transport(true) == ATL_STATUS_SUCCESS failed
worker-3090ti-wl: init transport failed
worker-3090ti-wl: Traceback (most recent call last):
worker-3090ti-wl:   File "/home/ubuntu/proj/minGPT/pipeline_model.py", line 367, in <module>
worker-3090ti-wl:     main()
worker-3090ti-wl:   File "/home/ubuntu/proj/minGPT/pipeline_model.py", line 309, in main
worker-3090ti-wl:     deepspeed.init_distributed(dist_backend="ccl")
worker-3090ti-wl:   File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 635, in init_distributed
worker-3090ti-wl:     init_deepspeed_backend(get_accelerator().communication_backend_name(), timeout, init_method)
worker-3090ti-wl:   File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 160, in init_deepspeed_backend
worker-3090ti-wl:     ccl_backend = CCLBackend(rank=rank, world_size=size, timeout=timeout, init_method=init_method)
worker-3090ti-wl:   File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/comm/ccl.py", line 53, in __init__
worker-3090ti-wl:     super(CCLBackend, self).broadcast(main_kvs, 0)
worker-3090ti-wl:   File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 205, in broadcast
worker-3090ti-wl:     return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
worker-3090ti-wl:   File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1400, in broadcast
worker-3090ti-wl:     work = default_pg.broadcast([tensor], opts)
worker-3090ti-wl: RuntimeError: oneCCL: atl_ofi_comm.cpp:79 atl_ofi_comm: EXCEPTION: init transport failed
worker-3090ti-wl: [2024-03-26 18:30:13,683] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 75330
worker-3090ti-wl: [2024-03-26 18:30:13,684] [ERROR] [launch.py:322:sigkill_handler] ['/home/axu/anaconda3/bin/python', '-u', 'pipeline_model.py', '--local_rank=0'] exits with return code = 1
pdsh@axu-Alienware-15-R4: worker-3090ti-wl: ssh exited with exit code 1

To Reproduce Steps to reproduce the behavior: See previous section.

Expected behavior Pipeline parallel based training thould start normally

ds_report output Please run ds_report to give us details about your setup.

ds_report result for both machines(the INFO displayed is caused by modification of the deepspeed library code):

[2024-03-26 21:12:52,143] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cpu (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented  [NO] ....... [OKAY]
deepspeed_ccl_comm ..... [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/axu/anaconda3/lib/python3.9/site-packages/torch']
torch version .................... 1.13.1
deepspeed install path ........... ['/home/axu/anaconda3/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.14.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.13 
shared memory (/dev/shm) size .... 15.60 GB
[2024-03-26 21:28:52,933] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cpu (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented  [NO] ....... [OKAY]
deepspeed_ccl_comm ..... [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch']
torch version .................... 1.13.1
deepspeed install path ........... ['/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.14.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.13 
shared memory (/dev/shm) size .... 62.78 GB

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

No, Only pdsh laucher are used.

Docker context Are you using a specific docker image that you can share?

No docker, just on physical machines

Additional context Any other information you need, I will provide them.

delock commented 3 months ago

Hi @xuanhua From this line it looks like the default launcher is used. Can you try impi launcher with the following?

deepspeed --launcher impi --num_nodes=2  --hostfile=./hostfile_linux pipeline_model.py
xuanhua commented 3 months ago

Hi @xuanhua From this line it looks like the default launcher is used. Can you try impi launcher with the following?

deepspeed --launcher impi --num_nodes=2  --hostfile=./hostfile_linux pipeline_model.py

Hi @delock it looks like impi mode is imcompitble with setting of --num_nodes in deepspeed. I drop --num_nodes option. And after a few trivial fixes I will got following errors:

cd /home/axu/proj/minGPT ; /usr/bin/env /home/axu/anaconda3/bin/python /home/axu/.vscode-server/extensions/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 56327 -- /home/axu/anaconda3/bin/deepspeed --launcher impi --master_addr=192.168.23.110 --hostfile=./hostfile_linux --master_port=29555 pipeline_model.py 
[2024-03-27 22:21:49,113] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cpu (override)
['mpirun', '-ppn', '1', '-genv', 'PYTHONIOENCODING', 'UTF-8', '-genv', 'PYTHONUNBUFFERED', '1', '-genv', 'PYTHONPATH', '/home/axu/proj/minGPT', '-genv', 'MASTER_ADDR', '192.168.23.110', '-genv', 'MASTER_PORT', '29555', '-genv', 'WORLD_SIZE', '2', '-genv', 'LOCAL_SIZE', '1', '-genv', 'I_MPI_PIN', '0', '-hosts', 'worker-alienbook-wl,worker-3090ti-wl', '-n', '1', '-env', 'RANK', '0', '-env', 'LOCAL_RANK', '0', '/home/axu/anaconda3/bin/python', '-u', 'pipeline_model.py', ':', '-n', '1', '-env', 'RANK', '1', '-env', 'LOCAL_RANK', '0', '/home/axu/anaconda3/bin/python', '-u', 'pipeline_model.py']
[2024-03-27 22:22:03,115] [INFO] [runner.py:568:main] cmd = mpirun -ppn 1 -genv PYTHONIOENCODING UTF-8 -genv PYTHONUNBUFFERED 1 -genv PYTHONPATH /home/axu/proj/minGPT -genv MASTER_ADDR 192.168.23.110 -genv MASTER_PORT 29555 -genv WORLD_SIZE 2 -genv LOCAL_SIZE 1 -genv I_MPI_PIN 0 -hosts worker-alienbook-wl,worker-3090ti-wl -n 1 -env RANK 0 -env LOCAL_RANK 0 /home/axu/anaconda3/bin/python -u pipeline_model.py : -n 1 -env RANK 1 -env LOCAL_RANK 0 /home/axu/anaconda3/bin/python -u pipeline_model.py
[2024-03-27 22:22:04,661] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cpu (override)
[2024-03-27 22:22:05,287] [INFO] [pipeline_model.py:297:main] use ninjia_installed_dir=/home/axu/anaconda3/bin
[2024-03-27 22:22:05,287] [INFO] [pipeline_model.py:308:main] before deepspeed.init_distributed()
[2024-03-27 22:22:05,287] [INFO] [pipeline_model.py:309:main] env_vars: environ({'SHELL': '/bin/bash', 'LC_ADDRESS': 'zh_CN.UTF-8', 'LC_NAME': 'zh_CN.UTF-8', 'LC_MONETARY': 'zh_CN.UTF-8', 'PWD': '/home/axu/proj/minGPT', 'LOGNAME': 'axu', 'XDG_SESSION_TYPE': 'tty', 'MOTD_SHOWN': 'pam', 'HOME': '/home/axu', 'LANG': 'zh_CN.UTF-8', 'LC_PAPER': 'zh_CN.UTF-8', 'SSH_CONNECTION': '192.168.8.8 55134 192.168.8.18 22', 'XDG_SESSION_CLASS': 'user', 'LC_IDENTIFICATION': 'zh_CN.UTF-8', 'USER': 'axu', 'SHLVL': '2', 'LC_TELEPHONE': 'zh_CN.UTF-8', 'LC_MEASUREMENT': 'zh_CN.UTF-8', 'XDG_SESSION_ID': '192', 'XDG_RUNTIME_DIR': '/run/user/1000', 'SSH_CLIENT': '192.168.8.8 55134 22', 'LC_TIME': 'zh_CN.UTF-8', 'PATH': '/home/axu/anaconda3/bin:/home/axu/bin/:/home/axu/bin/bin:/home/axu/anaconda3/bin:/home/axu/anaconda3/condabin:/home/axu/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6/bin/remote-cli:/home/axu/anaconda3/bin:/home/axu/bin:/home/axu/bin/bin:/home/axu/.cargo/bin:/home/axu/anaconda3/bin:/home/axu/bin:/home/axu/bin/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/axu/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6/bin/remote-cli:/home/axu/anaconda3/bin:/home/axu/bin:/home/axu/bin/bin:/home/axu/.cargo/bin:/home/axu/anaconda3/bin:/home/axu/bin:/home/axu/bin/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin/home/axu/anaconda3/bin:/home/axu/proj/minGPT/tmp/oneCCL/build/_install/bin:/home/axu/proj/minGPT/tmp/oneCCL/build/_install/bin', 'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/1000/bus', 'LC_NUMERIC': 'zh_CN.UTF-8', '_': '/usr/bin/env', 'COLORTERM': 'truecolor', 'TERM_PROGRAM_VERSION': '1.87.2', 'CONDA_EXE': '/home/axu/anaconda3/bin/conda', '_CE_M': '', 'LANGUAGE': 'zh_CN:zh:en_US:en', 'CONDA_ROOT': '/home/axu/anaconda3', 'CONDA_PREFIX': '/home/axu/anaconda3', 'VSCODE_GIT_ASKPASS_NODE': '/home/axu/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6/node', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_PROMPT_MODIFIER': '(base) ', 'GIT_ASKPASS': '/home/axu/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6/extensions/git/dist/askpass.sh', 'VSCODE_GIT_ASKPASS_EXTRA_ARGS': '', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'TERM': 'xterm-256color', 'VSCODE_ENV_REPLACE': 'CONDA_EXE=/home/axu/anaconda3/bin/conda:_CE_M=:CONDA_ROOT=/home/axu/anaconda3:CONDA_PREFIX=/home/axu/anaconda3:CONDA_PROMPT_MODIFIER=(base) :_CE_CONDA=:CONDA_SHLVL=1:CONDA_PYTHON_EXE=/home/axu/anaconda3/bin/python:CONDA_DEFAULT_ENV=base', '_CE_CONDA': '', 'LESSOPEN': '| /usr/bin/lesspipe %s', 'VSCODE_GIT_IPC_HANDLE': '/run/user/1000/vscode-git-24bb1d76d9.sock', 'CONDA_SHLVL': '1', 'CONDA_PYTHON_EXE': '/home/axu/anaconda3/bin/python', 'PS1': '\\[\\033[01;33m\\][\\t]\\[\\033[01;31m\\][0][192.168.8.18 192.168.23.110] \\[\\033[01;32m\\]\\u:\\[\\033[01;36m\\]\\w\\[\\033[00m\\] $', 'CONDA_DEFAULT_ENV': 'base', 'VSCODE_GIT_ASKPASS_MAIN': '/home/axu/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6/extensions/git/dist/askpass-main.js', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'BROWSER': '/home/axu/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6/bin/helpers/browser.sh', 'VSCODE_ENV_PREPEND': 'PATH=/home/axu/anaconda3/bin\\x3a/home/axu/anaconda3/condabin\\x3a/home/axu/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6/bin/remote-cli\\x3a/home/axu/anaconda3/bin\\x3a/home/axu/bin\\x3a/home/axu/bin/bin\\x3a/home/axu/.cargo/bin\\x3a/home/axu/anaconda3/bin\\x3a/home/axu/bin\\x3a/home/axu/bin/bin\\x3a/usr/local/sbin\\x3a/usr/local/bin\\x3a/usr/sbin\\x3a/usr/bin\\x3a/sbin\\x3a/bin\\x3a/usr/games\\x3a/usr/local/games\\x3a/snap/bin\\x3a', 'OLDPWD': '/home/axu/proj/minGPT', 'TERM_PROGRAM': 'vscode', 'VSCODE_IPC_HOOK_CLI': '/run/user/1000/vscode-ipc-384eec4e-08d1-4605-b755-ddcae2b3be38.sock', 'DS_ACCELERATOR': 'cpu', 'PYTHONIOENCODING': 'UTF-8', 'PYTHONUNBUFFERED': '1', 'PYDEVD_USE_FRAME_EVAL': 'NO', 'CCL_ROOT': '/home/axu/proj/minGPT/tmp/oneCCL/build/_install', 'FI_PROVIDER_PATH': '/home/axu/proj/minGPT/tmp/oneCCL/build/_install/opt/mpi/libfabric/lib/prov:/usr/lib64/libfabric', 'PYTHONPATH': '/home/axu/proj/minGPT', 'MASTER_ADDR': '192.168.23.110', 'MASTER_PORT': '29555', 'WORLD_SIZE': '2', 'LOCAL_SIZE': '1', 'I_MPI_PIN': '0', 'RANK': '1', 'LOCAL_RANK': '0', 'GFORTRAN_UNBUFFERED_PRECONNECTED': 'y', 'MPIR_CVAR_CH3_INTERFACE_HOSTNAME': 'worker-3090ti-wl', 'MPI_LOCALNRANKS': '1', 'MPI_LOCALRANKID': '0', 'PMI_RANK': '1', 'PMI_FD': '5', 'PMI_SIZE': '2', 'I_MPI_ROOT': '/home/axu/proj/minGPT/tmp/oneCCL/build/_install', 'CMAKE_PREFIX_PATH': '/home/axu/proj/minGPT/tmp/oneCCL/build/_install/lib/cmake/oneCCL:/home/axu/proj/minGPT/tmp/oneCCL/build/_install/lib/cmake/oneCCL', 'LIBRARY_PATH': '/home/axu/proj/minGPT/tmp/oneCCL/build/_install/lib:/home/axu/proj/minGPT/tmp/oneCCL/build/_install/lib', 'LD_LIBRARY_PATH': '/home/axu/proj/minGPT/tmp/oneCCL/build/_install/opt/mpi/libfabric/lib:/home/axu/proj/minGPT/tmp/oneCCL/build/_install/lib:/home/axu/proj/minGPT/tmp/oneCCL/build/_install/opt/mpi/libfabric/lib:/home/axu/proj/minGPT/tmp/oneCCL/build/_install/lib', 'CPATH': '/home/axu/proj/minGPT/tmp/oneCCL/build/_install/include:/home/axu/proj/minGPT/tmp/oneCCL/build/_install/include'})
Using /home/axu/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...
Creating extension directory /home/axu/.cache/torch_extensions/py39_cu117/deepspeed_ccl_comm...
Emitting ninja build file /home/axu/.cache/torch_extensions/py39_cu117/deepspeed_ccl_comm/build.ninja...
Building extension module deepspeed_ccl_comm...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[2024-03-27 22:22:05,430] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cpu (override)
[2024-03-27 22:22:06,548] [INFO] [pipeline_model.py:297:main] use ninjia_installed_dir=/home/axu/anaconda3/bin
[2024-03-27 22:22:06,548] [INFO] [pipeline_model.py:308:main] before deepspeed.init_distributed()
[2024-03-27 22:22:06,548] [INFO] [pipeline_model.py:309:main] env_vars: environ({'SHELL': '/bin/bash', 'LANGUAGE': 'zh_CN:zh:en_US:en', 'LC_ADDRESS': 'zh_CN.UTF-8', 'LC_NAME': 'zh_CN.UTF-8', 'LC_MONETARY': 'zh_CN.UTF-8', 'PWD': '/home/axu/proj/minGPT', 'LOGNAME': 'axu', 'XDG_SESSION_TYPE': 'tty', 'MOTD_SHOWN': 'pam', 'HOME': '/home/axu', 'LANG': 'zh_CN.UTF-8', 'LC_PAPER': 'zh_CN.UTF-8', 'SSH_CONNECTION': '192.168.8.8 55134 192.168.8.18 22', 'XDG_SESSION_CLASS': 'user', 'LC_IDENTIFICATION': 'zh_CN.UTF-8', 'USER': 'axu', 'SHLVL': '2', 'LC_TELEPHONE': 'zh_CN.UTF-8', 'LC_MEASUREMENT': 'zh_CN.UTF-8', 'XDG_SESSION_ID': '192', 'XDG_RUNTIME_DIR': '/run/user/1000', 'SSH_CLIENT': '192.168.8.8 55134 22', 'LC_TIME': 'zh_CN.UTF-8', 'PATH': '/home/axu/anaconda3/bin:/home/axu/bin/:/home/axu/bin/bin:/home/axu/anaconda3/bin:/home/axu/anaconda3/condabin:/home/axu/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6/bin/remote-cli:/home/axu/anaconda3/bin:/home/axu/bin:/home/axu/bin/bin:/home/axu/.cargo/bin:/home/axu/anaconda3/bin:/home/axu/bin:/home/axu/bin/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/axu/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6/bin/remote-cli:/home/axu/anaconda3/bin:/home/axu/bin:/home/axu/bin/bin:/home/axu/.cargo/bin:/home/axu/anaconda3/bin:/home/axu/bin:/home/axu/bin/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin/home/axu/anaconda3/bin:/home/axu/proj/minGPT/tmp/oneCCL/build/_install/bin:/home/axu/proj/minGPT/tmp/oneCCL/build/_install/bin', 'DBUS_SESSION_BUS_ADDRESS': 'unix:path=/run/user/1000/bus', 'LC_NUMERIC': 'zh_CN.UTF-8', '_': '/usr/bin/env', 'COLORTERM': 'truecolor', 'TERM_PROGRAM_VERSION': '1.87.2', 'CONDA_EXE': '/home/axu/anaconda3/bin/conda', '_CE_M': '', 'CONDA_ROOT': '/home/axu/anaconda3', 'CONDA_PREFIX': '/home/axu/anaconda3', 'VSCODE_GIT_ASKPASS_NODE': '/home/axu/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6/node', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_PROMPT_MODIFIER': '(base) ', 'GIT_ASKPASS': '/home/axu/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6/extensions/git/dist/askpass.sh', 'VSCODE_GIT_ASKPASS_EXTRA_ARGS': '', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'TERM': 'xterm-256color', 'VSCODE_ENV_REPLACE': 'CONDA_EXE=/home/axu/anaconda3/bin/conda:_CE_M=:CONDA_ROOT=/home/axu/anaconda3:CONDA_PREFIX=/home/axu/anaconda3:CONDA_PROMPT_MODIFIER=(base) :_CE_CONDA=:CONDA_SHLVL=1:CONDA_PYTHON_EXE=/home/axu/anaconda3/bin/python:CONDA_DEFAULT_ENV=base', '_CE_CONDA': '', 'LESSOPEN': '| /usr/bin/lesspipe %s', 'VSCODE_GIT_IPC_HANDLE': '/run/user/1000/vscode-git-24bb1d76d9.sock', 'CONDA_SHLVL': '1', 'CONDA_PYTHON_EXE': '/home/axu/anaconda3/bin/python', 'PS1': '\\[\\033[01;33m\\][\\t]\\[\\033[01;31m\\][0][192.168.8.18 192.168.23.110] \\[\\033[01;32m\\]\\u:\\[\\033[01;36m\\]\\w\\[\\033[00m\\] $', 'CONDA_DEFAULT_ENV': 'base', 'VSCODE_GIT_ASKPASS_MAIN': '/home/axu/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6/extensions/git/dist/askpass-main.js', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', 'BROWSER': '/home/axu/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6/bin/helpers/browser.sh', 'VSCODE_ENV_PREPEND': 'PATH=/home/axu/anaconda3/bin\\x3a/home/axu/anaconda3/condabin\\x3a/home/axu/.vscode-server/bin/863d2581ecda6849923a2118d93a088b0745d9d6/bin/remote-cli\\x3a/home/axu/anaconda3/bin\\x3a/home/axu/bin\\x3a/home/axu/bin/bin\\x3a/home/axu/.cargo/bin\\x3a/home/axu/anaconda3/bin\\x3a/home/axu/bin\\x3a/home/axu/bin/bin\\x3a/usr/local/sbin\\x3a/usr/local/bin\\x3a/usr/sbin\\x3a/usr/bin\\x3a/sbin\\x3a/bin\\x3a/usr/games\\x3a/usr/local/games\\x3a/snap/bin\\x3a', 'OLDPWD': '/home/axu/proj/minGPT', 'TERM_PROGRAM': 'vscode', 'VSCODE_IPC_HOOK_CLI': '/run/user/1000/vscode-ipc-384eec4e-08d1-4605-b755-ddcae2b3be38.sock', 'DS_ACCELERATOR': 'cpu', 'PYTHONIOENCODING': 'UTF-8', 'PYTHONUNBUFFERED': '1', 'PYDEVD_USE_FRAME_EVAL': 'NO', 'CCL_ROOT': '/home/axu/proj/minGPT/tmp/oneCCL/build/_install', 'FI_PROVIDER_PATH': '/home/axu/proj/minGPT/tmp/oneCCL/build/_install/opt/mpi/libfabric/lib/prov:/usr/lib64/libfabric', 'PYTHONPATH': '/home/axu/proj/minGPT', 'MASTER_ADDR': '192.168.23.110', 'MASTER_PORT': '29555', 'WORLD_SIZE': '2', 'LOCAL_SIZE': '1', 'I_MPI_PIN': '0', 'RANK': '0', 'LOCAL_RANK': '0', 'GFORTRAN_UNBUFFERED_PRECONNECTED': 'y', 'MPIR_CVAR_CH3_INTERFACE_HOSTNAME': 'worker-alienbook-wl', 'MPI_LOCALNRANKS': '1', 'MPI_LOCALRANKID': '0', 'PMI_RANK': '0', 'PMI_FD': '5', 'PMI_SIZE': '2', 'I_MPI_ROOT': '/home/axu/proj/minGPT/tmp/oneCCL/build/_install', 'CMAKE_PREFIX_PATH': '/home/axu/proj/minGPT/tmp/oneCCL/build/_install/lib/cmake/oneCCL:/home/axu/proj/minGPT/tmp/oneCCL/build/_install/lib/cmake/oneCCL', 'LIBRARY_PATH': '/home/axu/proj/minGPT/tmp/oneCCL/build/_install/lib:/home/axu/proj/minGPT/tmp/oneCCL/build/_install/lib', 'LD_LIBRARY_PATH': '/home/axu/proj/minGPT/tmp/oneCCL/build/_install/opt/mpi/libfabric/lib:/home/axu/proj/minGPT/tmp/oneCCL/build/_install/lib:/home/axu/proj/minGPT/tmp/oneCCL/build/_install/opt/mpi/libfabric/lib:/home/axu/proj/minGPT/tmp/oneCCL/build/_install/lib', 'CPATH': '/home/axu/proj/minGPT/tmp/oneCCL/build/_install/include:/home/axu/proj/minGPT/tmp/oneCCL/build/_install/include'})
Using /home/axu/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
Emitting ninja build file /home/axu/.cache/torch_extensions/py39_cu116/deepspeed_ccl_comm/build.ninja...
Building extension module deepspeed_ccl_comm...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module deepspeed_ccl_comm...
Time to load deepspeed_ccl_comm op: 0.09026837348937988 seconds
DeepSpeed deepspeed.ops.comm.deepspeed_ccl_comm_op built successfully
[1/2] c++ -MMD -MF ccl.o.d -DTORCH_EXTENSION_NAME=deepspeed_ccl_comm -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/cpu/includes -isystem /home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/include -isystem /home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/include/TH -isystem /home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/include/THC -isystem /home/ubuntu/anaconda3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O2 -fopenmp -c /home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/cpu/comm/ccl.cpp -o ccl.o 
/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/cpu/comm/ccl.cpp: In function ‘void initialize(int, int, at::Tensor&)’:
/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/cpu/comm/ccl.cpp:336:46: warning: ISO C++ forbids converting a string constant to ‘char*’ [-Wwrite-strings]
  336 |     if (addr_string == NULL) { addr_string = ""; }
      |                                              ^~
/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/cpu/comm/ccl.cpp:338:46: warning: ISO C++ forbids converting a string constant to ‘char*’ [-Wwrite-strings]
  338 |     if (port_string == NULL) { port_string = ""; }
      |                                              ^~
[2/2] c++ ccl.o -shared -lccl -L/home/axu/proj/minGPT/tmp/oneCCL/build/_install/lib -L/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o deepspeed_ccl_comm.so
Loading extension module deepspeed_ccl_comm...
Time to load deepspeed_ccl_comm op: 9.185137510299683 seconds
DeepSpeed deepspeed.ops.comm.deepspeed_ccl_comm_op built successfully
2024-03-27 22:22:14,542 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
2024-03-27 22:22:14,554 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2024-03-27 22:22:14,547 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
2024-03-27 22:22:14,557 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2024:03:27-22:24:25:(86734) |CCL_ERROR| internal_kvs.cpp:529 kvs_init: connection time (131) >= limit (120)
write: error: buf 0x55c25a179110, size 394, shift 0
2024:03:27-22:24:25:(86734) |CCL_ERROR| internal_kvs_server.hpp:66 put: read/write error: Broken pipe
2024:03:27-22:24:25:(86734) |CCL_ERROR| internal_kvs.cpp:108 kvs_get_value_by_name_key: client: get_value
2024:03:27-22:24:25:(86734) |CCL_ERROR| pmi_resizable_simple_internal.cpp:319 get_local_kvs_id: failed to get local kvs id
2024:03:27-22:24:25:(86734) |CCL_ERROR| pmi_resizable_simple_internal.cpp:65 pmrt_init: failed to get local id
2024:03:27-22:24:25:(86734) |CCL_ERROR| atl_mpi_comm.cpp:88 init_transport: pmi init failed

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 86734 RUNNING AT worker-3090ti-wl
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0@axu-Alienware-15-R4] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:878): assert (!closed) failed
[proxy:0:0@axu-Alienware-15-R4] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@axu-Alienware-15-R4] main (pm/pmiserv/pmip.c:200): demux engine error waiting for event
[mpiexec@axu-Alienware-15-R4] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:75): one of the processes terminated badly; aborting
[mpiexec@axu-Alienware-15-R4] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:22): launcher returned error waiting for completion
[mpiexec@axu-Alienware-15-R4] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:215): launcher returned error waiting for completion
[mpiexec@axu-Alienware-15-R4] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion
delock commented 3 months ago

Hi, @xuanhua This error indicates there is connection timeout. Can you confirm whether you have set ssh passwordless login? https://www.redhat.com/sysadmin/passwordless-ssh 2024:03:27-22:24:25:(86734) |CCL_ERROR| internal_kvs.cpp:529 kvs_init: connection time (131) >= limit (120)

xuanhua commented 3 months ago

Hi, @xuanhua This error indicates there is connection timeout. Can you confirm whether you have set ssh passwordless login? https://www.redhat.com/sysadmin/passwordless-ssh 2024:03:27-22:24:25:(86734) |CCL_ERROR| internal_kvs.cpp:529 kvs_init: connection time (131) >= limit (120)

Hi, @delock yes, the timeout message is the key. Actually I just used torchrun to check if both machine could communicate with each other in some very basic send/recv case correctly. And finally I found out the cause should be without explictly telling which network interface should be used. One of these two machine has both wireless and ethernet connection, So for deepspeed I think it's auto detected network interface is not correct. And it cause the final connection timeout.

xuanhua commented 3 months ago

Hi, @delock, thankyou so much for your patience. One more thing for double check. Deepspeed's pipeline parallel could suport model training across multiple nodes right ? And it could work together with zero-1 optimization, and the optimizer states corresponding each stage's parameters will also resides on the same node? Thankyou so much for your reply.

delock commented 3 months ago

Hi @xuanhua , pipeline should work across multiple nodes. My understanding if combine pipeline with zero 1, you will have 2 dimensional parallel. In the first dimension weights and optimization states are shareded between pipeline stages. In the second dimension, optimization states of each pipeline stage will be sharded among data parallel dims of zero 1.

Let's say you have 8 nodes, then you could have pipeline parallel with 4 stages and zero 1 with dp=2, this will make each node hold 1/8 of optimization states.

xuanhua commented 3 months ago

Hi, @delock I used two docker containers (built from the dockerfile provided by the deepspeed's master branch on github), And they could now communicate with each other over network. Their hostname are tt1 and tt2 (as you will see in following logs). But it still faied for: RuntimeError: ProcessGroupCCL does not support send. For more details, you could check below full logs. (previously you suggested using impi launcher, but for the docker case, there is a "mprirun.real cannot be find error", I did not dig in, so I switched back to pdsh launcher. And it looks like also works. )

[2024-04-08 07:26:05,769] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (override)
[2024-04-08 07:26:06,518] [INFO] [multinode_runner.py:81:get_cmd] Running on the following workers: tt1,tt2
[2024-04-08 07:26:06,519] [INFO] [runner.py:568:main] cmd = pdsh -S -f 1024 -w tt1,tt2 export PYTHON_VERSION=3; export PYTHONPATH=/home/deepspeed/mount/minGPT;  cd /home/deepspeed/mount/minGPT; /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJ0dDEiOiBbMF0sICJ0dDIiOiBbMF19 --node_rank=%n --master_addr=tt1 --master_port=29555 pipeline_model.py
tt1: [2024-04-08 07:26:07,354] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)
tt1: [2024-04-08 07:26:07,950] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.19.3-1
tt1: [2024-04-08 07:26:07,950] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.19.3-1
tt1: [2024-04-08 07:26:07,950] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
tt1: [2024-04-08 07:26:07,950] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.19.3-1+cuda12.2
tt1: [2024-04-08 07:26:07,950] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
tt1: [2024-04-08 07:26:07,950] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.19.3-1+cuda12.2
tt1: [2024-04-08 07:26:07,950] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.19.3-1
tt1: [2024-04-08 07:26:07,950] [INFO] [launch.py:145:main] WORLD INFO DICT: {'tt1': [0], 'tt2': [0]}
tt1: [2024-04-08 07:26:07,950] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
tt1: [2024-04-08 07:26:07,950] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'tt1': [0], 'tt2': [1]})
tt1: [2024-04-08 07:26:07,950] [INFO] [launch.py:163:main] dist_world_size=2
tt1: [2024-04-08 07:26:07,950] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
tt1: [2024-04-08 07:26:07,954] [INFO] [launch.py:253:main] process 4062 spawned with command: ['/usr/bin/python', '-u', 'pipeline_model.py', '--local_rank=0']
tt2: [2024-04-08 07:26:07,905] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)
tt1: [2024-04-08 07:26:08,514] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)
tt1: [2024-04-08 07:26:09,119] [INFO] [pipeline_model.py:307:main] use ninjia_installed_dir=/usr/local/bin
tt1: [2024-04-08 07:26:09,119] [INFO] [pipeline_model.py:318:main] before deepspeed.init_distributed()
tt1: [2024-04-08 07:26:09,119] [INFO] [pipeline_model.py:319:main] env_vars: environ({'LIBRARY_PATH': '/usr/local/cuda/lib64/stubs', 'HTTPS_PROXY': 'https://localhost:3213', 'NV_CUDA_COMPAT_PACKAGE': 'cuda-compat-12-2', 'NV_LIBCUBLAS_VERSION': '12.2.5.6-1', 'NV_NVPROF_DEV_PACKAGE': 'cuda-nvprof-12-2=12.2.142-1', 'NV_PEER_MEM_VERSION': '1.2', 'USER': 'deepspeed', 'no_proxy': '*.ubuntu.com,.example2.com,127.0.0.0/8', 'NV_CUDA_NSIGHT_COMPUTE_VERSION': '12.2.2-1', 'SSH_CLIENT': '10.0.1.2 54910 22', 'MLNX_OFED_VERSION': '4.9-7.1.0.0', 'LD_LIBRARY_PATH': '$/home/deepspeed/installers/oneCCL/build/_install/opt/mpi/libfabric/lib:$/home/deepspeed/installers/oneCCL/build/_install/opt/mpi/lib', 'NV_LIBNCCL_PACKAGE_VERSION': '2.19.3-1', 'HOME': '/home/deepspeed', 'MOTD_SHOWN': 'pam', 'OLDPWD': '/home/deepspeed', 'NO_PROXY': '*.ubuntu.com,.example2.com,127.0.0.0/8', 'NV_LIBCUBLAS_DEV_VERSION': '12.2.5.6-1', 'LC_CTYPE': 'C.UTF-8', 'NV_LIBNCCL_DEV_PACKAGE_VERSION': '2.19.3-1', 'NV_LIBNPP_PACKAGE': 'libnpp-12-2=12.2.1.4-1', 'CUDA_VERSION': '12.2.2', 'https_proxy': 'https://localhost:3213', 'NV_NVPROF_VERSION': '12.2.142-1', 'LOGNAME': 'deepspeed', 'NV_LIBCUBLAS_PACKAGE_NAME': 'libcublas-12-2', 'http_proxy': 'http://localhost:3213', 'NVIDIA_REQUIRE_CUDA': 'cuda>=12.2 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'NV_CUDA_LIB_VERSION': '12.2.2-1', 'NV_LIBCUSPARSE_VERSION': '12.1.2.141-1', 'NV_LIBNCCL_PACKAGE_NAME': 'libnccl2', 'NV_NVML_DEV_VERSION': '12.2.140-1', 'NV_LIBNPP_DEV_PACKAGE': 'libnpp-dev-12-2=12.2.1.4-1', 'NV_CUDA_CUDART_VERSION': '12.2.140-1', 'PATH': '$/home/deepspeed/installers/oneCCL/build/_install/opt/mpi/bin:/usr/local/bin:${PATH}$/home/deepspeed/installers/oneCCL/build/_install/opt/mpi/bin:/usr/local/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/sbin:/usr/bin:/sbin:/bin', 'NVARCH': 'x86_64', 'NV_LIBCUBLAS_PACKAGE': 'libcublas-12-2=12.2.5.6-1', 'NV_LIBCUBLAS_DEV_PACKAGE_NAME': 'libcublas-dev-12-2', 'NV_PEER_MEM_TAG': '1.2-0', 'NV_LIBNCCL_PACKAGE': 'libnccl2=2.19.3-1+cuda12.2', 'NV_LIBCUSPARSE_DEV_VERSION': '12.1.2.141-1', 'NV_LIBNCCL_DEV_PACKAGE_NAME': 'libnccl-dev', 'NVIDIA_PRODUCT_NAME': 'CUDA', 'NV_CUDA_CUDART_DEV_VERSION': '12.2.140-1', 'DEBIAN_FRONTEND': 'noninteractive', 'SHELL': '/bin/sh', 'NV_LIBCUBLAS_DEV_PACKAGE': 'libcublas-dev-12-2=12.2.5.6-1', 'OPENMPI_VERSION': '4.1.6', 'NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE': 'cuda-nsight-compute-12-2=12.2.2-1', 'PYTHON_VERSION': '3', 'NV_LIBNCCL_DEV_PACKAGE': 'libnccl-dev=2.19.3-1+cuda12.2', 'STAGE_DIR': '/tmp', 'NV_NVTX_VERSION': '12.2.140-1', 'NV_LIBNPP_VERSION': '12.2.1.4-1', 'PWD': '/home/deepspeed/mount/minGPT', 'SSH_CONNECTION': '10.0.1.2 54910 10.0.1.2 22', 'HTTP_PROXY': 'http://localhost:3213', 'PYTHONPATH': '/home/deepspeed/mount/minGPT', 'NVIDIA_VISIBLE_DEVICES': 'all', 'NCCL_VERSION': '2.19.3-1', 'NV_LIBNPP_DEV_VERSION': '12.2.1.4-1', 'OPENMPI_BASEVERSION': '4.1', 'CCL_ROOT': '/home/deepspeed/installers/oneCCL/build/_install', 'FI_PROVIDER_PATH': '/home/deepspeed/.local/lib/python3.8/site-packages/oneccl_bindings_for_pytorch/lib/prov', 'CUDA_VISIBLE_DEVICES': '0', 'MASTER_ADDR': 'tt1', 'MASTER_PORT': '29555', 'WORLD_SIZE': '2', 'CROSS_RANK': '0', 'CROSS_SIZE': '2', 'LOCAL_SIZE': '1', 'RANK': '0', 'LOCAL_RANK': '0', 'I_MPI_ROOT': '/home/deepspeed/installers/oneCCL/build/_install', 'CPATH': '/home/deepspeed/installers/oneCCL/build/_install/include:$/home/deepspeed/installers/oneCCL/build/_install/opt/mpi/include', 'CMAKE_PREFIX_PATH': '$/home/deepspeed/installers/oneCCL/build/_install/lib/cmake/oneCCL', 'CPLUS_INCLUDE_PATH': '$/home/deepspeed/installers/oneCCL/build/_install/include:$/home/deepspeed/installers/oneCCL/build/_install/opt/mpi/include'})
tt2: [2024-04-08 07:26:09,180] [INFO] [launch.py:138:main] 1 NV_LIBNCCL_PACKAGE_VERSION=2.19.3-1
tt2: [2024-04-08 07:26:09,180] [INFO] [launch.py:138:main] 1 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.19.3-1
tt2: [2024-04-08 07:26:09,180] [INFO] [launch.py:138:main] 1 NV_LIBNCCL_PACKAGE_NAME=libnccl2
tt2: [2024-04-08 07:26:09,180] [INFO] [launch.py:138:main] 1 NV_LIBNCCL_PACKAGE=libnccl2=2.19.3-1+cuda12.2
tt2: [2024-04-08 07:26:09,180] [INFO] [launch.py:138:main] 1 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
tt2: [2024-04-08 07:26:09,180] [INFO] [launch.py:138:main] 1 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.19.3-1+cuda12.2
tt2: [2024-04-08 07:26:09,180] [INFO] [launch.py:138:main] 1 NCCL_VERSION=2.19.3-1
tt2: [2024-04-08 07:26:09,180] [INFO] [launch.py:145:main] WORLD INFO DICT: {'tt1': [0], 'tt2': [0]}
tt2: [2024-04-08 07:26:09,180] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
tt2: [2024-04-08 07:26:09,180] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'tt1': [0], 'tt2': [1]})
tt2: [2024-04-08 07:26:09,180] [INFO] [launch.py:163:main] dist_world_size=2
tt2: [2024-04-08 07:26:09,180] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
tt2: [2024-04-08 07:26:09,186] [INFO] [launch.py:253:main] process 2667 spawned with command: ['/usr/bin/python', '-u', 'pipeline_model.py', '--local_rank=0']
tt1: No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
tt1: Using /home/deepspeed/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
tt1: Emitting ninja build file /home/deepspeed/.cache/torch_extensions/py38_cu117/deepspeed_ccl_comm/build.ninja...
tt1: Building extension module deepspeed_ccl_comm...
tt1: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
tt1: ninja: no work to do.
tt1: Loading extension module deepspeed_ccl_comm...
tt1: Time to load deepspeed_ccl_comm op: 0.04359579086303711 seconds
tt1: DeepSpeed deepspeed.ops.comm.deepspeed_ccl_comm_op built successfully
tt2: [2024-04-08 07:26:10,208] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)
tt2: [2024-04-08 07:26:11,485] [INFO] [pipeline_model.py:307:main] use ninjia_installed_dir=/usr/local/bin
tt2: [2024-04-08 07:26:11,486] [INFO] [pipeline_model.py:318:main] before deepspeed.init_distributed()
tt2: [2024-04-08 07:26:11,486] [INFO] [pipeline_model.py:319:main] env_vars: environ({'LIBRARY_PATH': '/usr/local/cuda/lib64/stubs', 'HTTPS_PROXY': 'https://localhost:3213', 'NV_CUDA_COMPAT_PACKAGE': 'cuda-compat-12-2', 'NV_LIBCUBLAS_VERSION': '12.2.5.6-1', 'NV_NVPROF_DEV_PACKAGE': 'cuda-nvprof-12-2=12.2.142-1', 'NV_PEER_MEM_VERSION': '1.2', 'USER': 'deepspeed', 'no_proxy': '*.ubuntu.com,.example2.com,127.0.0.0/8', 'NV_CUDA_NSIGHT_COMPUTE_VERSION': '12.2.2-1', 'SSH_CLIENT': '10.0.1.2 35316 22', 'MLNX_OFED_VERSION': '4.9-7.1.0.0', 'LD_LIBRARY_PATH': '$/home/deepspeed/installers/oneCCL/build/_install/opt/mpi/libfabric/lib:$/home/deepspeed/installers/oneCCL/build/_install/opt/mpi/lib', 'NV_LIBNCCL_PACKAGE_VERSION': '2.19.3-1', 'HOME': '/home/deepspeed', 'MOTD_SHOWN': 'pam', 'OLDPWD': '/home/deepspeed', 'NO_PROXY': '*.ubuntu.com,.example2.com,127.0.0.0/8', 'NV_LIBCUBLAS_DEV_VERSION': '12.2.5.6-1', 'LC_CTYPE': 'C.UTF-8', 'NV_LIBNCCL_DEV_PACKAGE_VERSION': '2.19.3-1', 'NV_LIBNPP_PACKAGE': 'libnpp-12-2=12.2.1.4-1', 'CUDA_VERSION': '12.2.2', 'https_proxy': 'https://localhost:3213', 'NV_NVPROF_VERSION': '12.2.142-1', 'LOGNAME': 'deepspeed', 'NV_LIBCUBLAS_PACKAGE_NAME': 'libcublas-12-2', 'http_proxy': 'http://localhost:3213', 'NVIDIA_REQUIRE_CUDA': 'cuda>=12.2 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'NV_CUDA_LIB_VERSION': '12.2.2-1', 'NV_LIBCUSPARSE_VERSION': '12.1.2.141-1', 'NV_LIBNCCL_PACKAGE_NAME': 'libnccl2', 'NV_NVML_DEV_VERSION': '12.2.140-1', 'NV_LIBNPP_DEV_PACKAGE': 'libnpp-dev-12-2=12.2.1.4-1', 'NV_CUDA_CUDART_VERSION': '12.2.140-1', 'PATH': '$/home/deepspeed/installers/oneCCL/build/_install/opt/mpi/bin:/usr/local/bin:${PATH}$/home/deepspeed/installers/oneCCL/build/_install/opt/mpi/bin:/usr/local/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/sbin:/usr/bin:/sbin:/bin', 'NVARCH': 'x86_64', 'NV_LIBCUBLAS_PACKAGE': 'libcublas-12-2=12.2.5.6-1', 'NV_LIBCUBLAS_DEV_PACKAGE_NAME': 'libcublas-dev-12-2', 'NV_PEER_MEM_TAG': '1.2-0', 'NV_LIBNCCL_PACKAGE': 'libnccl2=2.19.3-1+cuda12.2', 'NV_LIBCUSPARSE_DEV_VERSION': '12.1.2.141-1', 'NV_LIBNCCL_DEV_PACKAGE_NAME': 'libnccl-dev', 'NVIDIA_PRODUCT_NAME': 'CUDA', 'NV_CUDA_CUDART_DEV_VERSION': '12.2.140-1', 'DEBIAN_FRONTEND': 'noninteractive', 'SHELL': '/bin/sh', 'NV_LIBCUBLAS_DEV_PACKAGE': 'libcublas-dev-12-2=12.2.5.6-1', 'OPENMPI_VERSION': '4.1.6', 'NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE': 'cuda-nsight-compute-12-2=12.2.2-1', 'PYTHON_VERSION': '3', 'NV_LIBNCCL_DEV_PACKAGE': 'libnccl-dev=2.19.3-1+cuda12.2', 'STAGE_DIR': '/tmp', 'NV_NVTX_VERSION': '12.2.140-1', 'NV_LIBNPP_VERSION': '12.2.1.4-1', 'PWD': '/home/deepspeed/mount/minGPT', 'SSH_CONNECTION': '10.0.1.2 35316 10.0.1.4 22', 'HTTP_PROXY': 'http://localhost:3213', 'PYTHONPATH': '/home/deepspeed/mount/minGPT', 'NVIDIA_VISIBLE_DEVICES': 'all', 'NCCL_VERSION': '2.19.3-1', 'NV_LIBNPP_DEV_VERSION': '12.2.1.4-1', 'OPENMPI_BASEVERSION': '4.1', 'CCL_ROOT': '/home/deepspeed/installers/oneCCL/build/_install', 'FI_PROVIDER_PATH': '/home/deepspeed/.local/lib/python3.8/site-packages/oneccl_bindings_for_pytorch/lib/prov', 'CUDA_VISIBLE_DEVICES': '0', 'MASTER_ADDR': 'tt1', 'MASTER_PORT': '29555', 'WORLD_SIZE': '2', 'CROSS_RANK': '1', 'CROSS_SIZE': '2', 'LOCAL_SIZE': '1', 'RANK': '1', 'LOCAL_RANK': '0', 'I_MPI_ROOT': '/home/deepspeed/installers/oneCCL/build/_install', 'CPATH': '/home/deepspeed/installers/oneCCL/build/_install/include:$/home/deepspeed/installers/oneCCL/build/_install/opt/mpi/include', 'CMAKE_PREFIX_PATH': '$/home/deepspeed/installers/oneCCL/build/_install/lib/cmake/oneCCL', 'CPLUS_INCLUDE_PATH': '$/home/deepspeed/installers/oneCCL/build/_install/include:$/home/deepspeed/installers/oneCCL/build/_install/opt/mpi/include'})
tt2: No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
tt2: Using /home/deepspeed/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
tt2: Emitting ninja build file /home/deepspeed/.cache/torch_extensions/py38_cu117/deepspeed_ccl_comm/build.ninja...
tt2: Building extension module deepspeed_ccl_comm...
tt2: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
tt2: ninja: no work to do.
tt2: Loading extension module deepspeed_ccl_comm...
tt2: Time to load deepspeed_ccl_comm op: 0.08410215377807617 seconds
tt2: DeepSpeed deepspeed.ops.comm.deepspeed_ccl_comm_op built successfully
tt1: 2024-04-08 07:26:12,081 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 0
tt1: 2024-04-08 07:26:12,082 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
tt1: 2024:04:08-07:26:12:( 4062) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
tt2: 2024-04-08 07:26:11,999 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:1 to store for rank: 1
tt2: 2024-04-08 07:26:12,003 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
tt2: 2024:04:08-07:26:12:( 2667) |CCL_WARN| did not find MPI-launcher specific variables, switch to ATL/OFI, to force enable ATL/MPI set CCL_ATL_TRANSPORT=mpi
tt1: 2024:04:08-07:26:13:( 4062) |CCL_WARN| 
tt1: 
tt1: 
tt2: 2024:04:08-07:26:13:( 2667) |CCL_WARN| 
tt2: 
tt2: 
tt1: [2024-04-08 07:26:13,866] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
tt1: [2024-04-08 07:26:13,866] [INFO] [comm.py:637:init_distributed] cdb=<deepspeed.comm.ccl.CCLBackend object at 0x7fe5695d3280>
tt1: [2024-04-08 07:26:13,866] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
tt1: [2024-04-08 07:26:13,866] [INFO] [pipeline_model.py:321:main] after deepspeed.init_distributed()
tt1: SEED_LAYERS=False BASE_SEED=1234 SEED_FN=None
tt1: 2024-04-08 07:26:13,867 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:2 to store for rank: 0
tt2: [2024-04-08 07:26:13,785] [INFO] [comm.py:161:init_deepspeed_backend] Initialize ccl backend
tt2: [2024-04-08 07:26:13,785] [INFO] [comm.py:637:init_distributed] cdb=<deepspeed.comm.ccl.CCLBackend object at 0x7f645b5bc040>
tt2: [2024-04-08 07:26:13,785] [INFO] [pipeline_model.py:321:main] after deepspeed.init_distributed()
tt1: 2024-04-08 07:26:13,877 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
tt1: Using topology: {ProcessCoord(pipe=0, data=0): 0, ProcessCoord(pipe=1, data=0): 1}
tt1: 2024-04-08 07:26:13,878 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:3 to store for rank: 0
tt2: 2024-04-08 07:26:13,800 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:2 to store for rank: 1
tt2: 2024-04-08 07:26:13,806 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
tt2: 2024-04-08 07:26:13,812 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:3 to store for rank: 1
tt1: 2024-04-08 07:26:13,898 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:3 with 2 nodes.
tt1: 2024-04-08 07:26:13,899 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:4 to store for rank: 0
tt2: 2024-04-08 07:26:13,818 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:3 with 2 nodes.
tt1: 2024-04-08 07:26:13,909 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:4 with 2 nodes.
tt1: 2024-04-08 07:26:13,910 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:5 to store for rank: 0
tt2: 2024-04-08 07:26:13,824 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:4 to store for rank: 1
tt2: 2024-04-08 07:26:13,832 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:4 with 2 nodes.
tt2: 2024-04-08 07:26:13,835 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:5 to store for rank: 1
tt1: 2024-04-08 07:26:13,921 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:5 with 2 nodes.
tt1: 2024-04-08 07:26:13,922 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:6 to store for rank: 0
tt2: 2024-04-08 07:26:13,839 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:5 with 2 nodes.
tt1: 2024-04-08 07:26:13,932 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:6 with 2 nodes.
tt1: 2024-04-08 07:26:13,933 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:7 to store for rank: 0
tt2: 2024-04-08 07:26:13,843 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:6 to store for rank: 1
tt2: 2024-04-08 07:26:13,853 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:6 with 2 nodes.
tt1: 2024-04-08 07:26:13,943 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:7 with 2 nodes.
tt1: 2024-04-08 07:26:13,944 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:8 to store for rank: 0
tt2: 2024-04-08 07:26:13,863 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:7 to store for rank: 1
tt2: 2024-04-08 07:26:13,874 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:7 with 2 nodes.
tt1: 2024-04-08 07:26:13,965 - torch.distributed.distributed_c10d - INFO - Rank 0: Completed store-based barrier for key:store_based_barrier_key:8 with 2 nodes.
tt1: [2024-04-08 07:26:13,965] [INFO] [module.py:375:_partition_layers] Partitioning pipeline stages with method parameters
tt2: 2024-04-08 07:26:13,879 - torch.distributed.distributed_c10d - INFO - Added key: store_based_barrier_key:8 to store for rank: 1
tt1: stage=0 layers=1
tt1:      0: EmbeddingPipe
tt1: stage=1 layers=3
tt1:      1: BlockPipe
tt1:      2: LmHeadPipe
tt1:      3: LossPipe
tt1: [2024-04-08 07:26:13,975] [INFO] [pipeline_model.py:327:main] before deepspeed.initialize()
tt1: [2024-04-08 07:26:13,975] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.1+9f0e2136, git-hash=9f0e2136, git-branch=master
tt2: 2024-04-08 07:26:13,902 - torch.distributed.distributed_c10d - INFO - Rank 1: Completed store-based barrier for key:store_based_barrier_key:8 with 2 nodes.
tt2: [2024-04-08 07:26:13,916] [INFO] [pipeline_model.py:327:main] before deepspeed.initialize()
tt1: [2024-04-08 07:26:14,418] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
tt1: Using /home/deepspeed/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
tt1: Emitting ninja build file /home/deepspeed/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
tt1: Building extension module cpu_adam...
tt1: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
tt2: Using /home/deepspeed/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
tt1: ninja: no work to do.
tt1: Loading extension module cpu_adam...
tt1: Time to load cpu_adam op: 0.04545783996582031 seconds
tt1: Adam Optimizer #0 is created with scalar arithmetic capability.
tt1: Config: alpha=0.000020, betas=(0.900000, 0.950000), weight_decay=0.000500, adam_w=1
tt1: [2024-04-08 07:26:15,595] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer
tt1: [2024-04-08 07:26:15,595] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
tt1: [2024-04-08 07:26:15,595] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
tt1: [2024-04-08 07:26:15,595] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
tt1: [2024-04-08 07:26:15,595] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float32 ZeRO stage 1 optimizer
tt1: [2024-04-08 07:26:15,595] [WARNING] [engine.py:1512:_configure_zero_optimizer] Pipeline parallelism does not support overlapped communication, will be disabled.
tt1: [2024-04-08 07:26:15,595] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 200000000
tt1: [2024-04-08 07:26:15,595] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 200000000
tt1: [2024-04-08 07:26:15,595] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: True
tt1: [2024-04-08 07:26:15,595] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False
tt1: [2024-04-08 07:26:15,620] [INFO] [utils.py:809:see_memory_usage] Before initializing optimizer states
tt1: [2024-04-08 07:26:15,620] [INFO] [utils.py:810:see_memory_usage] MA 0.32 GB         Max_MA 0.32 GB         CA 0.32 GB         Max_CA 0 GB 
tt1: [2024-04-08 07:26:15,620] [INFO] [utils.py:817:see_memory_usage] CPU Virtual Memory:  used = 6.5 GB, percent = 5.2%
tt2: Emitting ninja build file /home/deepspeed/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
tt2: Building extension module cpu_adam...
tt2: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
tt1: [2024-04-08 07:26:15,637] [INFO] [utils.py:809:see_memory_usage] After initializing optimizer states
tt1: [2024-04-08 07:26:15,637] [INFO] [utils.py:810:see_memory_usage] MA 0.32 GB         Max_MA 0.32 GB         CA 0.32 GB         Max_CA 0 GB 
tt1: [2024-04-08 07:26:15,638] [INFO] [utils.py:817:see_memory_usage] CPU Virtual Memory:  used = 6.5 GB, percent = 5.2%
tt1: [2024-04-08 07:26:15,638] [INFO] [stage_1_and_2.py:538:__init__] optimizer state initialized
tt1: [2024-04-08 07:26:15,655] [INFO] [utils.py:809:see_memory_usage] After initializing ZeRO optimizer
tt1: [2024-04-08 07:26:15,655] [INFO] [utils.py:810:see_memory_usage] MA 0.32 GB         Max_MA 0.32 GB         CA 0.32 GB         Max_CA 0 GB 
tt1: [2024-04-08 07:26:15,655] [INFO] [utils.py:817:see_memory_usage] CPU Virtual Memory:  used = 6.5 GB, percent = 5.2%
tt1: [2024-04-08 07:26:15,655] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam
tt1: [2024-04-08 07:26:15,655] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
tt1: [2024-04-08 07:26:15,655] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
tt1: [2024-04-08 07:26:15,655] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[2e-05], mom=[[0.9, 0.95]]
tt1: [2024-04-08 07:26:15,655] [INFO] [config.py:996:print] DeepSpeedEngine configuration:
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   activation_checkpointing_config  {
tt1:     "partition_activations": false, 
tt1:     "contiguous_memory_optimization": false, 
tt1:     "cpu_checkpointing": false, 
tt1:     "number_checkpoints": null, 
tt1:     "synchronize_checkpoint_boundary": false, 
tt1:     "profile": false
tt1: }
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   amp_enabled .................. False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   amp_params ................... False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   autotuning_config ............ {
tt1:     "enabled": false, 
tt1:     "start_step": null, 
tt1:     "end_step": null, 
tt1:     "metric_path": null, 
tt1:     "arg_mappings": null, 
tt1:     "metric": "throughput", 
tt1:     "model_info": null, 
tt1:     "results_dir": "autotuning_results", 
tt1:     "exps_dir": "autotuning_exps", 
tt1:     "overwrite": true, 
tt1:     "fast": true, 
tt1:     "start_profile_step": 3, 
tt1:     "end_profile_step": 5, 
tt1:     "tuner_type": "gridsearch", 
tt1:     "tuner_early_stopping": 5, 
tt1:     "tuner_num_trials": 50, 
tt1:     "model_info_path": null, 
tt1:     "mp_size": 1, 
tt1:     "max_train_batch_size": null, 
tt1:     "min_train_batch_size": 1, 
tt1:     "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
tt1:     "min_train_micro_batch_size_per_gpu": 1, 
tt1:     "num_tuning_micro_batch_sizes": 3
tt1: }
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   bfloat16_enabled ............. False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   bfloat16_immediate_grad_update  False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   checkpoint_parallel_write_pipeline  False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   checkpoint_tag_validation_enabled  True
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   checkpoint_tag_validation_fail  False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fe564ab4760>
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   communication_data_type ...... None
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   compile_config ............... enabled=False backend='inductor' kwargs={}
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   curriculum_enabled_legacy .... False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   curriculum_params_legacy ..... False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   data_efficiency_enabled ...... False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   dataloader_drop_last ......... False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   disable_allgather ............ False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   dump_state ................... False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   dynamic_loss_scale_args ...... None
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   eigenvalue_enabled ........... False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   eigenvalue_gas_boundary_resolution  1
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   eigenvalue_layer_name ........ bert.encoder.layer
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   eigenvalue_layer_num ......... 0
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   eigenvalue_max_iter .......... 100
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   eigenvalue_stability ......... 1e-06
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   eigenvalue_tol ............... 0.01
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   eigenvalue_verbose ........... False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   elasticity_enabled ........... False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   flops_profiler_config ........ {
tt1:     "enabled": false, 
tt1:     "recompute_fwd_factor": 0.0, 
tt1:     "profile_step": 1, 
tt1:     "module_depth": -1, 
tt1:     "top_modules": 1, 
tt1:     "detailed": true, 
tt1:     "output_file": null
tt1: }
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   fp16_auto_cast ............... None
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   fp16_enabled ................. False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   fp16_master_weights_and_gradients  False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   global_rank .................. 0
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   grad_accum_dtype ............. None
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   gradient_accumulation_steps .. 4
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   gradient_clipping ............ 0.0
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   gradient_predivide_factor .... 1.0
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   graph_harvesting ............. False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   initial_dynamic_scale ........ 65536
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   load_universal_checkpoint .... False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   loss_scale ................... 0
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   memory_breakdown ............. False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   mics_hierarchial_params_gather  False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   mics_shard_size .............. -1
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   nebula_config ................ {
tt1:     "enabled": false, 
tt1:     "persistent_storage_path": null, 
tt1:     "persistent_time_interval": 100, 
tt1:     "num_of_version_in_retention": 2, 
tt1:     "enable_nebula_load": true, 
tt1:     "load_path": null
tt1: }
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   optimizer_legacy_fusion ...... False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   optimizer_name ............... adam
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   optimizer_params ............. {'lr': 2e-05, 'betas': [0.9, 0.95], 'eps': 1e-08, 'weight_decay': 0.0005}
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   pld_enabled .................. False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   pld_params ................... False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   prescale_gradients ........... False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   scheduler_name ............... None
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   scheduler_params ............. None
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   seq_parallel_communication_data_type  torch.float32
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   sparse_attention ............. None
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   sparse_gradients_enabled ..... False
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   steps_per_print .............. 5
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   train_batch_size ............. 8
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   train_micro_batch_size_per_gpu  2
tt1: [2024-04-08 07:26:15,656] [INFO] [config.py:1000:print]   use_data_before_expert_parallel_  False
tt1: [2024-04-08 07:26:15,657] [INFO] [config.py:1000:print]   use_node_local_storage ....... False
tt1: [2024-04-08 07:26:15,657] [INFO] [config.py:1000:print]   wall_clock_breakdown ......... False
tt1: [2024-04-08 07:26:15,657] [INFO] [config.py:1000:print]   weight_quantization_config ... None
tt1: [2024-04-08 07:26:15,657] [INFO] [config.py:1000:print]   world_size ................... 1
tt1: [2024-04-08 07:26:15,657] [INFO] [config.py:1000:print]   zero_allow_untested_optimizer  False
tt1: [2024-04-08 07:26:15,657] [INFO] [config.py:1000:print]   zero_config .................. stage=1 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=200000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=200000000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
tt1: [2024-04-08 07:26:15,657] [INFO] [config.py:1000:print]   zero_enabled ................. True
tt1: [2024-04-08 07:26:15,657] [INFO] [config.py:1000:print]   zero_force_ds_cpu_optimizer .. True
tt1: [2024-04-08 07:26:15,657] [INFO] [config.py:1000:print]   zero_optimization_stage ...... 1
tt1: [2024-04-08 07:26:15,657] [INFO] [config.py:986:print_user_config]   json = {
tt1:     "train_micro_batch_size_per_gpu": 2, 
tt1:     "gradient_accumulation_steps": 4, 
tt1:     "optimizer": {
tt1:         "type": "Adam", 
tt1:         "params": {
tt1:             "lr": 2e-05, 
tt1:             "betas": [0.9, 0.95], 
tt1:             "eps": 1e-08, 
tt1:             "weight_decay": 0.0005
tt1:         }
tt1:     }, 
tt1:     "fp16": {
tt1:         "enabled": false
tt1:     }, 
tt1:     "zero_optimization": {
tt1:         "stage": 1, 
tt1:         "offload_optimizer": {
tt1:             "device": "cpu", 
tt1:             "pin_memory": true
tt1:         }, 
tt1:         "allgather_partitions": true, 
tt1:         "allgather_bucket_size": 2.000000e+08, 
tt1:         "overlap_comm": true, 
tt1:         "reduce_scatter": true, 
tt1:         "reduce_bucket_size": 2.000000e+08, 
tt1:         "contiguous_gradients": true
tt1:     }, 
tt1:     "steps_per_print": 5
tt1: }
tt1: [2024-04-08 07:26:15,657] [INFO] [engine.py:101:__init__] CONFIG: micro_batches=4 micro_batch_size=2
tt1: [2024-04-08 07:26:15,657] [INFO] [engine.py:141:__init__] is_pipe_partitioned= False is_grad_partitioned= False
tt2: ninja: no work to do.
tt2: Loading extension module cpu_adam...
tt2: Time to load cpu_adam op: 0.08708024024963379 seconds
tt2: Adam Optimizer #0 is created with scalar arithmetic capability.
tt2: Config: alpha=0.000020, betas=(0.900000, 0.950000), weight_decay=0.000500, adam_w=1
tt2: [2024-04-08 07:26:15,590] [WARNING] [engine.py:1512:_configure_zero_optimizer] Pipeline parallelism does not support overlapped communication, will be disabled.
tt2: [2024-04-08 07:26:15,592] [INFO] [engine.py:141:__init__] is_pipe_partitioned= False is_grad_partitioned= False
tt1: [2024-04-08 07:26:15,679] [INFO] [engine.py:160:__init__] RANK=0 STAGE=0 LAYERS=1 [0, 1) STAGE_PARAMS=672 (0.001M) TOTAL_PARAMS=29184 (0.029M) UNIQUE_PARAMS=29184 (0.029M)
tt2: [2024-04-08 07:26:15,593] [INFO] [engine.py:160:__init__] RANK=1 STAGE=1 LAYERS=3 [1, 4) STAGE_PARAMS=28512 (0.029M) TOTAL_PARAMS=29184 (0.029M) UNIQUE_PARAMS=29184 (0.029M)
tt2: Traceback (most recent call last):
tt2:   File "pipeline_model.py", line 381, in <module>
tt2:     main()
tt2:   File "pipeline_model.py", line 329, in main
tt2:     engine, _, _, _ = deepspeed.initialize(model=model_pipe, 
tt2:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 196, in initialize
tt2:     engine = PipelineEngine(args=args,
tt2:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 225, in __init__
tt2:     p2p.recv(self.loss, self.prev_stage)
tt2:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/p2p.py", line 85, in recv
tt2:     return dist.recv(tensor, src_rank)
tt2:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
tt2:     return func(*args, **kwargs)
tt2:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py", line 364, in recv
tt1: Traceback (most recent call last):
tt1:   File "pipeline_model.py", line 381, in <module>
tt1:     main()
tt1:   File "pipeline_model.py", line 329, in main
tt1:     engine, _, _, _ = deepspeed.initialize(model=model_pipe, 
tt1:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/__init__.py", line 196, in initialize
tt1:     engine = PipelineEngine(args=args,
tt1:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/engine.py", line 220, in __init__
tt1:     p2p.send(self.loss, self.next_stage)
tt1:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/pipe/p2p.py", line 64, in send
tt1:     return dist.send(tensor, dest_rank)
tt1:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
tt1:     return func(*args, **kwargs)
tt1:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/comm.py", line 358, in send
tt1:     return cdb.send(tensor=tensor, dst=dst, group=group, tag=tag)
tt1:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/ccl.py", line 140, in send
tt1:     return self.run_collective(name="send", tensor=tensor, dst=dst, group=group, tag=tag)
tt1:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/ccl.py", line 75, in run_collective
tt1:     eval(func)(*(kwargs.values()))
tt1:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/torch.py", line 301, in send
tt1:     return torch.distributed.send(tensor=tensor, dst=dst, group=group, tag=tag)
tt1:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1159, in send
tt1:     default_pg.send([tensor], dst, tag).wait()
tt1: RuntimeError: ProcessGroupCCL does not support send
tt2:     return cdb.recv(tensor=tensor, src=src, group=group, tag=tag)
tt2:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/ccl.py", line 143, in recv
tt2:     return self.run_collective(name="recv", tensor=tensor, src=src, group=group, tag=tag)
tt2:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/ccl.py", line 75, in run_collective
tt2:     eval(func)(*(kwargs.values()))
tt2:   File "/usr/local/lib/python3.8/dist-packages/deepspeed/comm/torch.py", line 305, in recv
tt2:     return torch.distributed.recv(tensor=tensor, src=src, group=group, tag=tag)
tt2:   File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 1202, in recv
tt2:     pg.recv([tensor], src, tag).wait()
tt2: RuntimeError: ProcessGroupCCL does not support recv
tt1: [2024-04-08 07:26:15,962] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 4062
tt1: [2024-04-08 07:26:15,963] [ERROR] [launch.py:322:sigkill_handler] ['/usr/bin/python', '-u', 'pipeline_model.py', '--local_rank=0'] exits with return code = 1
pdsh@3bba6b9a240e: tt1: ssh exited with exit code 1
tt2: [2024-04-08 07:26:16,196] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2667
tt2: [2024-04-08 07:26:16,197] [ERROR] [launch.py:322:sigkill_handler] ['/usr/bin/python', '-u', 'pipeline_model.py', '--local_rank=0'] exits with return code = 1
pdsh@3bba6b9a240e: tt2: ssh exited with exit code 1
delock commented 3 months ago

@xuanhua I think this error is because oneCCL binding for PyTorch does not support send/recv yet. I think there are two way around this:

  1. Switch to gloo backend by replace ccl with gloo in deepspeed.init_distributed(dist_backend="ccl") and see if gloo backend works
  2. Use different parallelism i.e. ZeRO stage 2 across two nodes so send/recv won't be needed.
Armarella commented 2 weeks ago

Hi! I faced with the same problem for pipeline parallelism for CPU-only case before release DeepSpeed v0.14.1 when communication backend gloo have been added as alternative for oneCCL. After upgrading DeepSpeed to v0.14.1 and newer and swiching communication backend to 'gloo' pipeline parallelism for CPU-only works.

A I found that with 'gloo' training has better performance (training is faster then with oneCCL)

xuanhua commented 1 week ago

@Armarella Glad to hear that it could work on Deepspeed v0.14.1, I will try this later. And with docker container, you could have a scalable training infrastructure :)