microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.75k stars 4.05k forks source link

[BUG] PDSH_SSH_ARGS_APPEND environment variable is replaced (instead of appended to) #4370

Closed asolano closed 12 months ago

asolano commented 12 months ago

Describe the bug When using the pdsh launcher, the runner replaces any existing content of the PDSH_SSH_ARGS_APPEND environment variable, instead of adding to it. As a result, any existing values the user has set in the shell will be ignored.

Source:

https://github.com/microsoft/DeepSpeed/blob/78c3b148a8a8b6e60ab77a5c75849961f52b143d/deepspeed/launcher/multinode_runner.py#L69

To Reproduce

  1. Set some value to PDSH_SSH_ARGS_APPEND before launching deepspeed, i.e, export PDSH_SSH_ARGS_APPEND="-vv" for extra verbose logs.
  2. Launch deepspeed normally
  3. Look at the logs to confirm the verbosity option was ignored
PDSH_SSH_ARGS_APPEND=-vv
[2023-09-19 18:56:10,419] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-19 18:56:13,044] [INFO] [multinode_runner.py:72:get_cmd] Running on the following workers: g0153,g0168
[2023-09-19 18:56:13,044] [INFO] [runner.py:570:main] cmd = pdsh -S -f 1024 -w g0153,g0168 export NCCL_ROOT_DIR_modshare=/apps/nccl/2.12.12-1/cuda11.7:1; export NCCL_HOME_modshare=/apps/nccl/2.12.12-1/cuda11.7:1; export NCCL_HOME=/apps/nccl/2.12.12-1/cuda11.7; export NCCL_ROOT_DIR=/apps/nccl/2.12.12-1/cuda11.7; export NCCL_DEBUG=WARNING; export NCCL_SOCKET_IFNAME=eno; export PYTHONPATH=/home/acb11899xv/stanford_alpaca_gptneox;  cd /home/acb11899xv/stanford_alpaca_gptneox; /home/acb11899xv/miniconda3/envs/deepspeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJnMDE1MyI6IFswLCAxLCAyLCAzXSwgImcwMTY4IjogWzAsIDEsIDIsIDNdfQ== --node_rank=%n --master_addr=g0153 --master_port=29500 train_v10.py --model_name_or_path 'matsuo-lab/weblab-10b' --data_path 'alpaca_data.json' --bf16 'False' --output_dir '/home/acb11899xv/shared/temp_output' --num_train_epochs '1' --per_device_train_batch_size '1' --per_device_eval_batch_size '1' --gradient_accumulation_steps '1' --gradient_checkpointing --evaluation_strategy 'no' --save_strategy 'steps' --save_steps '5000' --save_total_limit '4' --learning_rate '4e-6' --weight_decay '0.' --warmup_ratio '0.03' --logging_steps '1' --deepspeed './configs/default_offload_opt_param_v7.json' --cache_dir '/home/acb11899xv/shared/hf_cache/' --tf32 'False' --model_max_length '1024'
g0153: [2023-09-19 18:56:15,252] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

Expected behavior The existing value of PDSH_SSH_ARGS_APPEND is preserved (and added to if necessary). In the example setup before, the ssh commands execute with verbosity.

After manually changing the = to a += in the source code:

PDSH_SSH_ARGS_APPEND=-vv
[2023-09-20 16:16:41,048] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-20 16:16:43,417] [INFO] [multinode_runner.py:72:get_cmd] Running on the following workers: g0379,g0380
[2023-09-20 16:16:43,417] [INFO] [runner.py:570:main] cmd = pdsh -S -f 1024 -w g0379,g0380 export NCCL_ROOT_DIR_modshare=/apps/nccl/2.12.12-1/cuda11.7:1; export NCCL_HOME_modshare=/apps/nccl/2.12.12-1/cuda11.7:1; export NCCL_HOME=/apps/nccl/2.12.12-1/cuda11.7; export NCCL_ROOT_DIR=/apps/nccl/2.12.12-1/cuda11.7; export NCCL_DEBUG=WARNING; export NCCL_SOCKET_IFNAME=eno; export PYTHONPATH=/home/acb11899xv/stanford_alpaca_gptneox;  cd /home/acb11899xv/stanford_alpaca_gptneox; /home/acb11899xv/miniconda3/envs/deepspeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJnMDM3OSI6IFswLCAxLCAyLCAzXSwgImcwMzgwIjogWzAsIDEsIDIsIDNdfQ== --node_rank=%n --master_addr=g0379 --master_port=29500 train_v10.py --model_name_or_path 'matsuo-lab/weblab-10b' --data_path 'alpaca_data.json' --bf16 'False' --output_dir '/home/acb11899xv/shared/temp_output' --num_train_epochs '1' --per_device_train_batch_size '1' --per_device_eval_batch_size '1' --gradient_accumulation_steps '1' --gradient_checkpointing --evaluation_strategy 'no' --save_strategy 'steps' --save_steps '5000' --save_total_limit '4' --learning_rate '4e-6' --weight_decay '0.' --warmup_ratio '0.03' --logging_steps '1' --deepspeed './configs/default_offload_opt_param_v7.json' --cache_dir '/home/acb11899xv/shared/hf_cache/' --tf32 'False' --model_max_length '1024'
g0379: OpenSSH_8.0p1, OpenSSL 1.1.1k  FIPS 25 Mar 2021
g0380: OpenSSH_8.0p1, OpenSSL 1.1.1k  FIPS 25 Mar 2021
g0379: debug1: Reading configuration data /etc/ssh/ssh_config
g0380: debug1: Reading configuration data /etc/ssh/ssh_config
g0379: debug1: Reading configuration data /etc/ssh/ssh_config.d/01-enable_keysign.conf
g0380: debug1: Reading configuration data /etc/ssh/ssh_config.d/01-enable_keysign.conf
g0379: debug1: Reading configuration data /etc/ssh/ssh_config.d/05-redhat.conf
g0379: debug2: checking match for 'final all' host g0379 originally g0379
g0379: debug2: match not found
...

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
random_ltd ............. [YES] ...... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/acb11899xv/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1
deepspeed install path ........... ['/home/acb11899xv/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.3, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
shared memory (/dev/shm) size .... 188.13 GB

Screenshots N/A

System info (please complete the following information):

Launcher context Launching with the DeepSpeed launcher:

deepspeed \
    --hostfile ./hostfile \
    --master_addr=$MASTER_ADDR \
    --launcher=pdsh \
    --ssh_port=2299 \
train_v10.py ...

Docker context N/A

Additional context

On the original pull request (https://github.com/microsoft/DeepSpeed/pull/4117), the code was a "+=" instead of a "=", so maybe that was the original intention. Alternatively, if the environment variables set in the shell that launches deepspeed are not supposed to be preserved, indicate so in the documentation.

loadams commented 12 months ago

Hi @asolano - good catch, I believe the original intention was += and was lost when the code was moved to the other file. It seems like you've already tested this, but I've made a PR to correct the behavior here: https://github.com/microsoft/DeepSpeed/pull/4373

asolano commented 12 months ago

@loadams Good to hear 👍 (and thanks for the quick response).