microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.16k stars 4.07k forks source link

[BUG] no node_rank is passed in PDSHRunner #1523

Closed chunyang-wen closed 2 years ago

chunyang-wen commented 2 years ago

Describe the bug PDSHRunner why we pass node_rank=%n image

Expected behavior A clear and concise description of what you expected to happen.

Screenshots image

jeffra commented 2 years ago

Hi @chunyang-wen, thanks for your question. This is a bit subtle and should definitely be better documented here since it's not obvious when reading the code.

When using pdsh it will replace %n with the node rank, see: https://linux.die.net/man/1/pdsh.

chunyang-wen commented 2 years ago

Thanks for your explanation. image

hanjiemicro commented 2 years ago

Hi, The "%n" was not replaced when the -w option only has 1 node. How to deal with this issue? cmd = pdsh -f 1024 -w 192.168.100.101 cd /home/hy/hanjie/CPM-2-Pretrain/src; /home/hy/hanjie/anaconda3/bin/python -u -m deepspeed.launcher.launch --world_info=eyIxOTIuMTY4LjEwMC4xMDEiOiBbMCwgMV19 --node_rank=%n --master_addr=172.18.222.210 --master_port=29500 /home/hy/hanjie/CPM-2-Pretrain/src/pretrain_enc_dec.py --model-config '/home/hy/hanjie/CPM-2-Pretrain/src/configs/model/enc_dec_small_config.json' --model-parallel-size '1' --batch-size '4' --enc-seq-length '32' --dec-seq-length '32' --train-iters '1' --save '/home/hy/hanjie/CPM-2-Pretrain/results/' --no-save-optim --log-file '/home/hy/hanjie/CPM-2-Pretrain/results//log.txt' --gradient-accumulation-steps '64' --data-path '/home/hy/hanjie/CPM-2-Pretrain/src/pretrain_data/wudao_corpus_document' --split '100,1,1' --lr '0.0001' --no-load-optim --weight-decay '1e-2' --clip-grad '1.0' --warmup '0.01' --tokenizer-path '/home/hy/hanjie/CPM-2-Pretrain/bpe_cn_en' --save-interval '100' --eval-interval '1' --eval-iters '1' --log-interval '1' --deepspeed --deepspeed_config '/home/hy/hanjie/CPM-2-Pretrain/src/configs/model/ds_batch_config.json' --cpu-optimizer --cpu_torch_adam

fabiogeraci commented 4 days ago

Should we use the %% notation