[BUG] PDSH_SSH_ARGS_APPEND environment variable is replaced (instead of appended to)

Describe the bug When using the pdsh launcher, the runner replaces any existing content of the PDSH_SSH_ARGS_APPEND environment variable, instead of adding to it. As a result, any existing values the user has set in the shell will be ignored.

Source:

https://github.com/microsoft/DeepSpeed/blob/78c3b148a8a8b6e60ab77a5c75849961f52b143d/deepspeed/launcher/multinode_runner.py#L69

To Reproduce

Set some value to PDSH_SSH_ARGS_APPEND before launching deepspeed, i.e, export PDSH_SSH_ARGS_APPEND="-vv" for extra verbose logs.
Launch deepspeed normally
Look at the logs to confirm the verbosity option was ignored

PDSH_SSH_ARGS_APPEND=-vv
[2023-09-19 18:56:10,419] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-19 18:56:13,044] [INFO] [multinode_runner.py:72:get_cmd] Running on the following workers: g0153,g0168
[2023-09-19 18:56:13,044] [INFO] [runner.py:570:main] cmd = pdsh -S -f 1024 -w g0153,g0168 export NCCL_ROOT_DIR_modshare=/apps/nccl/2.12.12-1/cuda11.7:1; export NCCL_HOME_modshare=/apps/nccl/2.12.12-1/cuda11.7:1; export NCCL_HOME=/apps/nccl/2.12.12-1/cuda11.7; export NCCL_ROOT_DIR=/apps/nccl/2.12.12-1/cuda11.7; export NCCL_DEBUG=WARNING; export NCCL_SOCKET_IFNAME=eno; export PYTHONPATH=/home/acb11899xv/stanford_alpaca_gptneox;  cd /home/acb11899xv/stanford_alpaca_gptneox; /home/acb11899xv/miniconda3/envs/deepspeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJnMDE1MyI6IFswLCAxLCAyLCAzXSwgImcwMTY4IjogWzAsIDEsIDIsIDNdfQ== --node_rank=%n --master_addr=g0153 --master_port=29500 train_v10.py --model_name_or_path 'matsuo-lab/weblab-10b' --data_path 'alpaca_data.json' --bf16 'False' --output_dir '/home/acb11899xv/shared/temp_output' --num_train_epochs '1' --per_device_train_batch_size '1' --per_device_eval_batch_size '1' --gradient_accumulation_steps '1' --gradient_checkpointing --evaluation_strategy 'no' --save_strategy 'steps' --save_steps '5000' --save_total_limit '4' --learning_rate '4e-6' --weight_decay '0.' --warmup_ratio '0.03' --logging_steps '1' --deepspeed './configs/default_offload_opt_param_v7.json' --cache_dir '/home/acb11899xv/shared/hf_cache/' --tf32 'False' --model_max_length '1024'
g0153: [2023-09-19 18:56:15,252] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

Expected behavior The existing value of PDSH_SSH_ARGS_APPEND is preserved (and added to if necessary). In the example setup before, the ssh commands execute with verbosity.

After manually changing the = to a += in the source code:

PDSH_SSH_ARGS_APPEND=-vv
[2023-09-20 16:16:41,048] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-20 16:16:43,417] [INFO] [multinode_runner.py:72:get_cmd] Running on the following workers: g0379,g0380
[2023-09-20 16:16:43,417] [INFO] [runner.py:570:main] cmd = pdsh -S -f 1024 -w g0379,g0380 export NCCL_ROOT_DIR_modshare=/apps/nccl/2.12.12-1/cuda11.7:1; export NCCL_HOME_modshare=/apps/nccl/2.12.12-1/cuda11.7:1; export NCCL_HOME=/apps/nccl/2.12.12-1/cuda11.7; export NCCL_ROOT_DIR=/apps/nccl/2.12.12-1/cuda11.7; export NCCL_DEBUG=WARNING; export NCCL_SOCKET_IFNAME=eno; export PYTHONPATH=/home/acb11899xv/stanford_alpaca_gptneox;  cd /home/acb11899xv/stanford_alpaca_gptneox; /home/acb11899xv/miniconda3/envs/deepspeed/bin/python -u -m deepspeed.launcher.launch --world_info=eyJnMDM3OSI6IFswLCAxLCAyLCAzXSwgImcwMzgwIjogWzAsIDEsIDIsIDNdfQ== --node_rank=%n --master_addr=g0379 --master_port=29500 train_v10.py --model_name_or_path 'matsuo-lab/weblab-10b' --data_path 'alpaca_data.json' --bf16 'False' --output_dir '/home/acb11899xv/shared/temp_output' --num_train_epochs '1' --per_device_train_batch_size '1' --per_device_eval_batch_size '1' --gradient_accumulation_steps '1' --gradient_checkpointing --evaluation_strategy 'no' --save_strategy 'steps' --save_steps '5000' --save_total_limit '4' --learning_rate '4e-6' --weight_decay '0.' --warmup_ratio '0.03' --logging_steps '1' --deepspeed './configs/default_offload_opt_param_v7.json' --cache_dir '/home/acb11899xv/shared/hf_cache/' --tf32 'False' --model_max_length '1024'
g0379: OpenSSH_8.0p1, OpenSSL 1.1.1k  FIPS 25 Mar 2021
g0380: OpenSSH_8.0p1, OpenSSL 1.1.1k  FIPS 25 Mar 2021
g0379: debug1: Reading configuration data /etc/ssh/ssh_config
g0380: debug1: Reading configuration data /etc/ssh/ssh_config
g0379: debug1: Reading configuration data /etc/ssh/ssh_config.d/01-enable_keysign.conf
g0380: debug1: Reading configuration data /etc/ssh/ssh_config.d/01-enable_keysign.conf
g0379: debug1: Reading configuration data /etc/ssh/ssh_config.d/05-redhat.conf
g0379: debug2: checking match for 'final all' host g0379 originally g0379
g0379: debug2: match not found
...

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
cpu_adagrad ............ [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
random_ltd ............. [YES] ...... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/acb11899xv/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1
deepspeed install path ........... ['/home/acb11899xv/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.3, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
shared memory (/dev/shm) size .... 188.13 GB

Screenshots N/A

System info (please complete the following information):

OS: Rocky Linux release 8.6 (Green Obsidian)
GPU count and types: 2 machines with x4 V100s each
Interconnects (if applicable): 2 machines connected with InfiniBand
Python version: Python 3.10.13
Any other relevant info about your setup: HPC cluster with user-level access

Launcher context Launching with the DeepSpeed launcher:

deepspeed \
    --hostfile ./hostfile \
    --master_addr=$MASTER_ADDR \
    --launcher=pdsh \
    --ssh_port=2299 \
train_v10.py ...

Docker context N/A

Additional context

On the original pull request (https://github.com/microsoft/DeepSpeed/pull/4117), the code was a "+=" instead of a "=", so maybe that was the original intention. Alternatively, if the environment variables set in the shell that launches deepspeed are not supposed to be preserved, indicate so in the documentation.

microsoft / DeepSpeed

[BUG] PDSH_SSH_ARGS_APPEND environment variable is replaced (instead of appended to) #4370