bing0037 commented 1 year ago

Problem: How to run ds_pretrain_gpt2-zero3.sh in docker env using 2 nodes?

Current platform: 1) I have 2 nodes (100.3.8.100 & 100.3.8.68). For each node (with 8 V100 GPUs), the docker environments are all set and I can run ds_pretrain_gpt2-zero3.sh in local sucessfully using the following command:

docker run -it --gpus all -v /home/libn/DeepSpeedExamples:/workspace deepspeed:latest

2) I installed pdsh in each node: For 100.3.8.100, I install pdsh in docker environments (deepspeed:latest): For 100.3.8.68, I install pdsh in docker env (deepspeed:latest).

My modification: I want to use 2 nodes and I modify the ds_pretrain_gpt2-zero3.sh: 1) change run_cmd to: run_cmd=deepspeed --hostfile=myhostfile pretrain_gpt2.py ${@:2} ${full_options} 2) add myhostfile:

100.3.8.100 slots=2
100.3.8.68 slots=2

My test and error: Test: in 100.3.8.100 server and docker env (deepspeed:latest)

docker run -it --gpus all -v /home/libn/DeepSpeedExamples:/workspace deepspeed:latest
cd megatron/Megatron-LM-v1.1.5-ZeRO3/exmaples
bash ds_pretrain_gpt2-zero3.sh

Error:

100.3.8.100: bash: line 0: cd: /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/exmaples: No such file or directory 
100.3.8.100: bash: /opt/conda/bin/python: No such file or directory 
100.3.8.68: bash: line 0: cd: /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/exmaples: No such file or directory 
100.3.8.68: bash: /opt/conda/bin/python: No such file or directory

Any suggestions? Thanks!

tjruwase commented 1 year ago

@bing0037, did you verify that those paths are valid in both containers?

bing0037 commented 1 year ago

@bing0037, did you verify that those paths are valid in both containers?

Thanks for your reply.

Here are all my tests and results:

Test 1: test docker containers in node1(100.3.8.100) and node2 (100.3.8.68):

node1:

docker run -it --gpus all -v /home/libn/DeepSpeedExamples:/workspace deepspeed:latest
cd /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples
bash ds_pretrain_gpt2-zero3.sh

Result:

root@15418839153d:/workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples# bash ds_pretrain_gpt2-zero3.sh 
deepspeed --include=localhost:1,2,6,7 ../pretrain_gpt2.py --model-parallel-size 2 --num-layers 5 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --batch-size 4 --train-iters 320000 --lr-decay-iters 320000 --save ../gpt2_checkpoints/gpt2_ds_zero3 --data-path ../my-gpt2_text_document --vocab-file ../pretrain_models/gpt2/gpt2-vocab.json --merge-file ../pretrain_models/gpt2/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 1.5e-4 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --warmup 0.01 --checkpoint-activations --log-interval 1 --save-interval 10000 --eval-interval 2000 --eval-iters 10 --fp16 --scattered-embeddings --split-transformers --deepspeed --deepspeed_config /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_3_config.json --zero-stage 3 --zero-reduce-bucket-size 50000000 --zero-allgather-bucket-size 5000000000 --zero-contigious-gradients --zero-reduce-scatter --deepspeed-activation-checkpointing --checkpoint-num-layers 1 --partition-activations --checkpoint-in-cpu --synchronize-each-layer --contigious-checkpointing
[2023-03-03 02:20:20,137] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-03-03 02:20:20,970] [INFO] [runner.py:360:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMSwgMiwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 ../pretrain_gpt2.py --model-parallel-size 2 --num-layers 5 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --batch-size 4 --train-iters 320000 --lr-decay-iters 320000 --save ../gpt2_checkpoints/gpt2_ds_zero3 --data-path ../my-gpt2_text_document --vocab-file ../pretrain_models/gpt2/gpt2-vocab.json --merge-file ../pretrain_models/gpt2/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 1.5e-4 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --warmup 0.01 --checkpoint-activations --log-interval 1 --save-interval 10000 --eval-interval 2000 --eval-iters 10 --fp16 --scattered-embeddings --split-transformers --deepspeed --deepspeed_config /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_3_config.json --zero-stage 3 --zero-reduce-bucket-size 50000000 --zero-allgather-bucket-size 5000000000 --zero-contigious-gradients --zero-reduce-scatter --deepspeed-activation-checkpointing --checkpoint-num-layers 1 --partition-activations --checkpoint-in-cpu --synchronize-each-layer --contigious-checkpointing
[2023-03-03 02:20:21,988] [INFO] [launch.py:73:main] 0 NCCL_VERSION 2.7.8
[2023-03-03 02:20:21,988] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [1, 2, 6, 7]}
[2023-03-03 02:20:21,988] [INFO] [launch.py:89:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-03-03 02:20:21,988] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-03-03 02:20:21,988] [INFO] [launch.py:102:main] dist_world_size=4
[2023-03-03 02:20:21,988] [INFO] [launch.py:105:main] Setting CUDA_VISIBLE_DEVICES=1,2,6,7
using world size: 4 and model-parallel size: 2 
using torch.float16 for parameters ...
-------------------- arguments --------------------
  adam_beta1 ...................... 0.9
  adam_beta2 ...................... 0.999
  adam_eps ........................ 1e-08
  adlr_autoresume ................. False
  adlr_autoresume_interval ........ 1000
  apply_query_key_layer_scaling ... False
  apply_residual_connection_post_layernorm  False
  attention_dropout ............... 0.1
  attention_softmax_in_fp32 ....... False
  batch_size ...................... 4
  bert_load ....................... None
  bias_dropout_fusion ............. False
  bias_gelu_fusion ................ False
  block_data_path ................. None
  checkpoint_activations .......... True
  checkpoint_in_cpu ............... True
  checkpoint_num_layers ........... 1
  clip_grad ....................... 1.0
  contigious_checkpointing ........ True
  cpu_optimizer ................... False
  cpu_torch_adam .................. False
  data_impl ....................... mmap
  data_path ....................... ../my-gpt2_text_document
  DDP_impl ........................ local
  deepscale ....................... False
  deepscale_config ................ None
  deepspeed ....................... True
  deepspeed_activation_checkpointing  True
  deepspeed_config ................ /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_3_config.json
  deepspeed_mpi ................... False
  distribute_checkpointed_activations  False
  distributed_backend ............. nccl
  dynamic_loss_scale .............. True
  eod_mask_loss ................... False
  eval_interval ................... 2000
  eval_iters ...................... 10
  exit_interval ................... None
  faiss_use_gpu ................... False
  finetune ........................ False
  fp16 ............................ True
  fp16_lm_cross_entropy ........... False
  fp32_allreduce .................. False
  hidden_dropout .................. 0.1
  hidden_size ..................... 1024
  hysteresis ...................... 2
  ict_head_size ................... None
  ict_load ........................ None
  indexer_batch_size .............. 128
  indexer_log_interval ............ 1000
  init_method_std ................. 0.02
  layernorm_epsilon ............... 1e-05
  lazy_mpu_init ................... None
  load ............................ None
  local_rank ...................... 0
  log_interval .................... 1
  loss_scale ...................... None
  loss_scale_window ............... 1000
  lr .............................. 0.00015
  lr_decay_iters .................. 320000
  lr_decay_style .................. cosine
  lr_decay_tokens ................. None
  make_vocab_size_divisible_by .... 128
  mask_prob ....................... 0.15
  max_position_embeddings ......... 1024
  memory_centric_tiled_linear ..... False
  merge_file ...................... ../pretrain_models/gpt2/gpt2-merges.txt
  min_lr .......................... 1e-05
  min_scale ....................... 1
  mmap_warmup ..................... False
  model_parallel_size ............. 2
  no_load_optim ................... False
  no_load_rng ..................... False
  no_save_optim ................... False
  no_save_rng ..................... False
  num_attention_heads ............. 16
  num_layers ...................... 5
  num_unique_layers ............... None
  num_workers ..................... 2
  onnx_safe ....................... None
  openai_gelu ..................... False
  override_lr_scheduler ........... False
  param_sharing_style ............. grouped
  params_dtype .................... torch.float16
  partition_activations ........... True
  profile_backward ................ False
  query_in_block_prob ............. 0.1
  rank ............................ 0
  remote_device ................... none
  report_topk_accuracies .......... []
  reset_attention_mask ............ False
  reset_position_ids .............. False
  save ............................ ../gpt2_checkpoints/gpt2_ds_zero3
  save_interval ................... 10000
  scaled_masked_softmax_fusion .... False
  scaled_upper_triang_masked_softmax_fusion  False
  scattered_embeddings ............ True
  seed ............................ 1234
  seq_length ...................... 1024
  short_seq_prob .................. 0.1
  split ........................... 949,50,1
  split_transformers .............. True
  synchronize_each_layer .......... True
  tensorboard_dir ................. None
  tile_factor ..................... 1
  titles_data_path ................ None
  tokenizer_type .................. GPT2BPETokenizer
  tokens .......................... 0
  train_iters ..................... 320000
  train_tokens .................... None
  use_checkpoint_lr_scheduler ..... False
  use_cpu_initialization .......... False
  use_one_sent_docs ............... False
  use_pin_memory .................. False
  vocab_file ...................... ../pretrain_models/gpt2/gpt2-vocab.json
  warmup .......................... 0.01
  warmup_iters .................... None
  weight_decay .................... 0.01
  world_size ...................... 4
  zero_allgather_bucket_size ...... 5000000000
  zero_contigious_gradients ....... True
  zero_reduce_bucket_size ......... 50000000
  zero_reduce_scatter ............. True
  zero_stage ...................... 3
---------------- end of arguments ----------------
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 175 dummy tokens (new size: 50432)
> initializing torch distributed ...
> initializing model parallel with size 2
> setting random seeds to 1234 ...
[2023-03-03 02:20:24,686] [INFO] [checkpointing.py:231:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT2 model ...
[2023-03-03 02:20:28,050] [INFO] [utils.py:588:see_memory_usage] Before Building Model
/opt/conda/lib/python3.7/site-packages/torch/cuda/memory.py:386: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
  FutureWarning)
/opt/conda/lib/python3.7/site-packages/torch/cuda/memory.py:394: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
  FutureWarning)
[2023-03-03 02:20:28,051] [INFO] [utils.py:593:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2023-03-03 02:20:28,051] [INFO] [utils.py:598:see_memory_usage] CPU Virtual Memory:  used = 20.86 GB, percent = 2.8%
 > number of parameters on model parallel rank 1            0.117 Billion
[2023-03-03 02:20:28,734] [INFO] [utils.py:588:see_memory_usage] After Building Model
[2023-03-03 02:20:28,735] [INFO] [utils.py:593:see_memory_usage] MA 0.06 GB         Max_MA 0.07 GB         CA 0.08 GB         Max_CA 0 GB 
[2023-03-03 02:20:28,735] [INFO] [utils.py:598:see_memory_usage] CPU Virtual Memory:  used = 21.01 GB, percent = 2.8%
 > number of parameters on model parallel rank 0            0.117 Billion
> learning rate decay style: cosine
DeepSpeed is enabled.
[2023-03-03 02:20:28,736] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.4.3, git-hash=unknown, git-branch=unknown
[2023-03-03 02:20:28,740] [INFO] [engine.py:180:__init__] DeepSpeed Flops Profiler Enabled: True
[2023-03-03 02:20:28,740] [INFO] [engine.py:700:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2023-03-03 02:20:28,740] [INFO] [engine.py:704:_configure_optimizer] Using client Optimizer as basic optimizer
[2023-03-03 02:20:28,740] [INFO] [engine.py:714:_configure_optimizer] DeepSpeed Basic Optimizer = AdamW
[2023-03-03 02:20:28,740] [INFO] [utils.py:44:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-03-03 02:20:28,740] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
[2023-03-03 02:20:28,740] [INFO] [engine.py:938:_configure_zero_optimizer] Initializing ZeRO Stage 3
[2023-03-03 02:20:28,743] [INFO] [stage3.py:633:__init__] Reduce bucket size 10000000.0
[2023-03-03 02:20:28,743] [INFO] [stage3.py:634:__init__] Allgather bucket size 10000000.0
[2023-03-03 02:20:30,016] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | init_optimizer_state: 7.66
[2023-03-03 02:20:30,016] [INFO] [stage3.py:825:__init__] optimizer state initialized
[2023-03-03 02:20:30,032] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-03-03 02:20:30,032] [INFO] [engine.py:516:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2023-03-03 02:20:30,033] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x7fdea6b68d50>
[2023-03-03 02:20:30,033] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2023-03-03 02:20:30,033] [INFO] [config.py:900:print] DeepSpeedEngine configuration:
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   allreduce_always_fp32 ........ False
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   amp_enabled .................. False
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   amp_params ................... False
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   checkpoint_tag_validation_enabled  True
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   checkpoint_tag_validation_fail  False
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   disable_allgather ............ False
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   dump_state ................... False
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   eigenvalue_enabled ........... False
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   eigenvalue_gas_boundary_resolution  1
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   eigenvalue_layer_num ......... 0
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   eigenvalue_max_iter .......... 100
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   eigenvalue_stability ......... 1e-06
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   eigenvalue_tol ............... 0.01
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   eigenvalue_verbose ........... False
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print]   elasticity_enabled ........... False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   flops_profiler_config ........ {
    "enabled": true, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   fp16_enabled ................. True
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   fp16_mixed_quantize .......... False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   global_rank .................. 0
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   gradient_accumulation_steps .. 1
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   gradient_clipping ............ 1.0
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   gradient_predivide_factor .... 1.0
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   initial_dynamic_scale ........ 4294967296
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   loss_scale ................... 0
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   memory_breakdown ............. False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   optimizer_legacy_fusion ...... False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   optimizer_name ............... None
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   optimizer_params ............. None
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   pld_enabled .................. False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   pld_params ................... False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   prescale_gradients ........... False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   quantize_change_rate ......... 0.001
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   quantize_groups .............. 1
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   quantize_offset .............. 1000
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   quantize_period .............. 1000
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   quantize_rounding ............ 0
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   quantize_start_bits .......... 16
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   quantize_target_bits ......... 8
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   quantize_training_enabled .... False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   quantize_type ................ 0
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   quantize_verbose ............. False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   scheduler_name ............... None
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   scheduler_params ............. None
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   sparse_attention ............. None
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print]   sparse_gradients_enabled ..... False
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print]   steps_per_print .............. 1
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print]   tensorboard_enabled .......... False
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print]   tensorboard_job_name ......... DeepSpeedJobName
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print]   tensorboard_output_path ...... 
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print]   train_batch_size ............. 64
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print]   train_micro_batch_size_per_gpu  32
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print]   use_quantizer_kernel ......... False
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print]   wall_clock_breakdown ......... True
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print]   world_size ................... 2
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print]   zero_allow_untested_optimizer  False
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print]   zero_config .................. {
    "stage": 3, 
    "contiguous_gradients": true, 
    "reduce_scatter": true, 
    "reduce_bucket_size": 1.000000e+07, 
    "allgather_partitions": true, 
    "allgather_bucket_size": 5.000000e+08, 
    "overlap_comm": true, 
    "load_from_fp32_weights": true, 
    "elastic_checkpoint": true, 
    "offload_param": null, 
    "offload_optimizer": null, 
    "sub_group_size": 1.000000e+09, 
    "prefetch_bucket_size": 1.000000e+07, 
    "param_persistence_threshold": 1.000000e+05, 
    "max_live_parameters": 1.000000e+09, 
    "max_reuse_distance": 1.000000e+09, 
    "gather_fp16_weights_on_model_save": false, 
    "ignore_unused_parameters": true, 
    "legacy_stage1": false
}
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print]   zero_enabled ................. True
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print]   zero_optimization_stage ...... 3
[2023-03-03 02:20:30,035] [INFO] [config.py:911:print]   json = {
    "train_batch_size": 64, 
    "gradient_accumulation_steps": 1, 
    "steps_per_print": 1, 
    "zero_optimization": {
        "stage": 3, 
        "stage3_max_live_parameters": 1.000000e+09, 
        "stage3_max_reuse_distance": 1.000000e+09, 
        "stage3_prefetch_bucket_size": 1.000000e+07, 
        "stage3_param_persistence_threshold": 1.000000e+05, 
        "reduce_bucket_size": 1.000000e+07, 
        "contiguous_gradients": true
    }, 
    "gradient_clipping": 1.0, 
    "fp16": {
        "enabled": true, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "wall_clock_breakdown": true, 
    "comms_logger": {
        "enabled": true, 
        "verbose": true, 
        "prof_all": false, 
        "debug": false, 
        "prof_ops": ["all_reduce", "all_gather"]
    }, 
    "zero_allow_untested_optimizer": false, 
    "flops_profiler": {
        "enabled": true, 
        "profile_step": 1, 
        "module_depth": -1, 
        "top_modules": 1, 
        "detailed": true, 
        "output_file": null
    }
}
> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      2560000
    validation: 12880
    test:       80
> building train, validation, and test datasets for GPT2 ...
 > building dataset index ...
    reading sizes...
    reading pointers...
    reading document index...
    creating numpy buffer of mmap...
    creating memory view of numpy buffer...
 > finished creating indexed dataset in 0.000291 seconds
    number of documents: 2761
 > dataset split:
    train:
     document indices in [0, 2620) total of 2620 documents
    validation:
     document indices in [2620, 2758) total of 138 documents
    test:
     document indices in [2758, 2761) total of 3 documents
 > loading doc-idx mapping from ../my-gpt2_text_document_train_indexmap_2560000ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from ../my-gpt2_text_document_train_indexmap_2560000ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from ../my-gpt2_text_document_train_indexmap_2560000ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of samples: 2560022
    total number of epochs: 16395
 > loading doc-idx mapping from ../my-gpt2_text_document_valid_indexmap_12880ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from ../my-gpt2_text_document_valid_indexmap_12880ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from ../my-gpt2_text_document_valid_indexmap_12880ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of samples: 12884
    total number of epochs: 1784
 > loading doc-idx mapping from ../my-gpt2_text_document_test_indexmap_80ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from ../my-gpt2_text_document_test_indexmap_80ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from ../my-gpt2_text_document_test_indexmap_80ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of samples: 81
    total number of epochs: 97
> finished creating GPT2 datasets ...
setting training data start iteration to 0
setting validation data start iteration to 0
done with setups ...
time (ms) | model and optimizer: 2032.63 | train/valid/test data iterators: 897.19
training ...

node2: the code directory in host is different from node1, but is the same in docker env.

docker run -it --gpus all -v /home/libn/code/Distributed/DeepSpeedExamples:/workspace deepspeed:latest
cd /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples
bash ds_pretrain_gpt2-zero3.sh

Result:

root@77b68329c514:/workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples# bash ds_pretrain_gpt2-zero3.sh 
deepspeed --num_nodes 1 --num_gpus 1 ../pretrain_gpt2.py --model-parallel-size 1 --num-layers 5 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --batch-size 4 --train-iters 320000 --lr-decay-iters 320000 --save ../gpt2_checkpoints/gpt2_ds_zero3 --data-path ../my-gpt2_text_document --vocab-file ../pretrain_models/gpt2/gpt2-vocab.json --merge-file ../pretrain_models/gpt2/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 1.5e-4 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --warmup 0.01 --checkpoint-activations --log-interval 1 --save-interval 10000 --eval-interval 2000 --eval-iters 10 --fp16 --scattered-embeddings --split-transformers --deepspeed --deepspeed_config /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_3_config.json --zero-stage 3 --zero-reduce-bucket-size 50000000 --zero-allgather-bucket-size 5000000000 --zero-contigious-gradients --zero-reduce-scatter --deepspeed-activation-checkpointing --checkpoint-num-layers 1 --partition-activations --checkpoint-in-cpu --synchronize-each-layer --contigious-checkpointing
[2023-03-03 02:21:54,970] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-03-03 02:21:56,868] [INFO] [runner.py:360:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 ../pretrain_gpt2.py --model-parallel-size 1 --num-layers 5 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --batch-size 4 --train-iters 320000 --lr-decay-iters 320000 --save ../gpt2_checkpoints/gpt2_ds_zero3 --data-path ../my-gpt2_text_document --vocab-file ../pretrain_models/gpt2/gpt2-vocab.json --merge-file ../pretrain_models/gpt2/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 1.5e-4 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --warmup 0.01 --checkpoint-activations --log-interval 1 --save-interval 10000 --eval-interval 2000 --eval-iters 10 --fp16 --scattered-embeddings --split-transformers --deepspeed --deepspeed_config /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_3_config.json --zero-stage 3 --zero-reduce-bucket-size 50000000 --zero-allgather-bucket-size 5000000000 --zero-contigious-gradients --zero-reduce-scatter --deepspeed-activation-checkpointing --checkpoint-num-layers 1 --partition-activations --checkpoint-in-cpu --synchronize-each-layer --contigious-checkpointing
[2023-03-03 02:21:58,249] [INFO] [launch.py:73:main] 0 NCCL_VERSION 2.7.8
[2023-03-03 02:21:58,249] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0]}
[2023-03-03 02:21:58,250] [INFO] [launch.py:89:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-03-03 02:21:58,250] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-03-03 02:21:58,250] [INFO] [launch.py:102:main] dist_world_size=1
[2023-03-03 02:21:58,250] [INFO] [launch.py:105:main] Setting CUDA_VISIBLE_DEVICES=0
using world size: 1 and model-parallel size: 1 
using torch.float16 for parameters ...
-------------------- arguments --------------------
  adam_beta1 ...................... 0.9
  adam_beta2 ...................... 0.999
  adam_eps ........................ 1e-08
  adlr_autoresume ................. False
  adlr_autoresume_interval ........ 1000
  apply_query_key_layer_scaling ... False
  apply_residual_connection_post_layernorm  False
  attention_dropout ............... 0.1
  attention_softmax_in_fp32 ....... False
  batch_size ...................... 4
  bert_load ....................... None
  bias_dropout_fusion ............. False
  bias_gelu_fusion ................ False
  block_data_path ................. None
  checkpoint_activations .......... True
  checkpoint_in_cpu ............... True
  checkpoint_num_layers ........... 1
  clip_grad ....................... 1.0
  contigious_checkpointing ........ True
  cpu_optimizer ................... False
  cpu_torch_adam .................. False
  data_impl ....................... mmap
  data_path ....................... ../my-gpt2_text_document
  DDP_impl ........................ local
  deepscale ....................... False
  deepscale_config ................ None
  deepspeed ....................... True
  deepspeed_activation_checkpointing  True
  deepspeed_config ................ /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_3_config.json
  deepspeed_mpi ................... False
  distribute_checkpointed_activations  False
  distributed_backend ............. nccl
  dynamic_loss_scale .............. True
  eod_mask_loss ................... False
  eval_interval ................... 2000
  eval_iters ...................... 10
  exit_interval ................... None
  faiss_use_gpu ................... False
  finetune ........................ False
  fp16 ............................ True
  fp16_lm_cross_entropy ........... False
  fp32_allreduce .................. False
  hidden_dropout .................. 0.1
  hidden_size ..................... 1024
  hysteresis ...................... 2
  ict_head_size ................... None
  ict_load ........................ None
  indexer_batch_size .............. 128
  indexer_log_interval ............ 1000
  init_method_std ................. 0.02
  layernorm_epsilon ............... 1e-05
  lazy_mpu_init ................... None
  load ............................ None
  local_rank ...................... 0
  log_interval .................... 1
  loss_scale ...................... None
  loss_scale_window ............... 1000
  lr .............................. 0.00015
  lr_decay_iters .................. 320000
  lr_decay_style .................. cosine
  lr_decay_tokens ................. None
  make_vocab_size_divisible_by .... 128
  mask_prob ....................... 0.15
  max_position_embeddings ......... 1024
  memory_centric_tiled_linear ..... False
  merge_file ...................... ../pretrain_models/gpt2/gpt2-merges.txt
  min_lr .......................... 1e-05
  min_scale ....................... 1
  mmap_warmup ..................... False
  model_parallel_size ............. 1
  no_load_optim ................... False
  no_load_rng ..................... False
  no_save_optim ................... False
  no_save_rng ..................... False
  num_attention_heads ............. 16
  num_layers ...................... 5
  num_unique_layers ............... None
  num_workers ..................... 2
  onnx_safe ....................... None
  openai_gelu ..................... False
  override_lr_scheduler ........... False
  param_sharing_style ............. grouped
  params_dtype .................... torch.float16
  partition_activations ........... True
  profile_backward ................ False
  query_in_block_prob ............. 0.1
  rank ............................ 0
  remote_device ................... none
  report_topk_accuracies .......... []
  reset_attention_mask ............ False
  reset_position_ids .............. False
  save ............................ ../gpt2_checkpoints/gpt2_ds_zero3
  save_interval ................... 10000
  scaled_masked_softmax_fusion .... False
  scaled_upper_triang_masked_softmax_fusion  False
  scattered_embeddings ............ True
  seed ............................ 1234
  seq_length ...................... 1024
  short_seq_prob .................. 0.1
  split ........................... 949,50,1
  split_transformers .............. True
  synchronize_each_layer .......... True
  tensorboard_dir ................. None
  tile_factor ..................... 1
  titles_data_path ................ None
  tokenizer_type .................. GPT2BPETokenizer
  tokens .......................... 0
  train_iters ..................... 320000
  train_tokens .................... None
  use_checkpoint_lr_scheduler ..... False
  use_cpu_initialization .......... False
  use_one_sent_docs ............... False
  use_pin_memory .................. False
  vocab_file ...................... ../pretrain_models/gpt2/gpt2-vocab.json
  warmup .......................... 0.01
  warmup_iters .................... None
  weight_decay .................... 0.01
  world_size ...................... 1
  zero_allgather_bucket_size ...... 5000000000
  zero_contigious_gradients ....... True
  zero_reduce_bucket_size ......... 50000000
  zero_reduce_scatter ............. True
  zero_stage ...................... 3
---------------- end of arguments ----------------
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing model parallel with size 1
> setting random seeds to 1234 ...
[2023-03-03 02:22:00,223] [INFO] [checkpointing.py:231:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT2 model ...
[2023-03-03 02:22:03,259] [INFO] [utils.py:588:see_memory_usage] Before Building Model
/opt/conda/lib/python3.7/site-packages/torch/cuda/memory.py:386: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
  FutureWarning)
/opt/conda/lib/python3.7/site-packages/torch/cuda/memory.py:394: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
  FutureWarning)
[2023-03-03 02:22:03,260] [INFO] [utils.py:593:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2023-03-03 02:22:03,260] [INFO] [utils.py:598:see_memory_usage] CPU Virtual Memory:  used = 11.13 GB, percent = 1.5%
[2023-03-03 02:22:03,485] [INFO] [utils.py:588:see_memory_usage] After Building Model
[2023-03-03 02:22:03,486] [INFO] [utils.py:593:see_memory_usage] MA 0.22 GB         Max_MA 0.22 GB         CA 0.26 GB         Max_CA 0 GB 
[2023-03-03 02:22:03,486] [INFO] [utils.py:598:see_memory_usage] CPU Virtual Memory:  used = 11.17 GB, percent = 1.5%
 > number of parameters on model parallel rank 0            0.116 Billion
> learning rate decay style: cosine
DeepSpeed is enabled.
[2023-03-03 02:22:03,487] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.4.3, git-hash=unknown, git-branch=unknown
[2023-03-03 02:22:03,491] [INFO] [engine.py:180:__init__] DeepSpeed Flops Profiler Enabled: False
[2023-03-03 02:22:03,491] [INFO] [engine.py:700:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2023-03-03 02:22:03,491] [INFO] [engine.py:704:_configure_optimizer] Using client Optimizer as basic optimizer
[2023-03-03 02:22:03,491] [INFO] [engine.py:714:_configure_optimizer] DeepSpeed Basic Optimizer = AdamW
[2023-03-03 02:22:03,491] [INFO] [utils.py:44:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-03-03 02:22:03,491] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
[2023-03-03 02:22:03,491] [INFO] [engine.py:938:_configure_zero_optimizer] Initializing ZeRO Stage 3
[2023-03-03 02:22:03,494] [INFO] [stage3.py:633:__init__] Reduce bucket size 10000000.0
[2023-03-03 02:22:03,494] [INFO] [stage3.py:634:__init__] Allgather bucket size 10000000.0
[2023-03-03 02:22:03,731] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | init_optimizer_state: 15.07
[2023-03-03 02:22:03,731] [INFO] [stage3.py:825:__init__] optimizer state initialized
[2023-03-03 02:22:03,748] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-03-03 02:22:03,748] [INFO] [engine.py:516:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2023-03-03 02:22:03,748] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x7f44ae4a6dd0>
[2023-03-03 02:22:03,748] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2023-03-03 02:22:03,748] [INFO] [config.py:900:print] DeepSpeedEngine configuration:
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   allreduce_always_fp32 ........ False
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   amp_enabled .................. False
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   amp_params ................... False
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   checkpoint_tag_validation_enabled  True
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   checkpoint_tag_validation_fail  False
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   disable_allgather ............ False
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   dump_state ................... False
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   eigenvalue_enabled ........... False
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   eigenvalue_gas_boundary_resolution  1
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   eigenvalue_layer_num ......... 0
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   eigenvalue_max_iter .......... 100
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   eigenvalue_stability ......... 1e-06
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   eigenvalue_tol ............... 0.01
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   eigenvalue_verbose ........... False
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print]   elasticity_enabled ........... False
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   fp16_enabled ................. True
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   fp16_mixed_quantize .......... False
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   global_rank .................. 0
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   gradient_accumulation_steps .. 1
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   gradient_clipping ............ 1.0
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   gradient_predivide_factor .... 1.0
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   initial_dynamic_scale ........ 4294967296
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   loss_scale ................... 0
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   memory_breakdown ............. False
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   optimizer_legacy_fusion ...... False
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   optimizer_name ............... None
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   optimizer_params ............. None
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   pld_enabled .................. False
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   pld_params ................... False
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   prescale_gradients ........... False
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   quantize_change_rate ......... 0.001
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   quantize_groups .............. 1
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   quantize_offset .............. 1000
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   quantize_period .............. 1000
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print]   quantize_rounding ............ 0
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   quantize_start_bits .......... 16
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   quantize_target_bits ......... 8
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   quantize_training_enabled .... False
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   quantize_type ................ 0
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   quantize_verbose ............. False
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   scheduler_name ............... None
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   scheduler_params ............. None
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   sparse_attention ............. None
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   sparse_gradients_enabled ..... False
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   steps_per_print .............. 1
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   tensorboard_enabled .......... False
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   tensorboard_job_name ......... DeepSpeedJobName
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   tensorboard_output_path ...... 
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   train_batch_size ............. 64
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   train_micro_batch_size_per_gpu  64
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   use_quantizer_kernel ......... False
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   wall_clock_breakdown ......... True
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   world_size ................... 1
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   zero_allow_untested_optimizer  False
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print]   zero_config .................. {
    "stage": 3, 
    "contiguous_gradients": true, 
    "reduce_scatter": true, 
    "reduce_bucket_size": 1.000000e+07, 
    "allgather_partitions": true, 
    "allgather_bucket_size": 5.000000e+08, 
    "overlap_comm": true, 
    "load_from_fp32_weights": true, 
    "elastic_checkpoint": true, 
    "offload_param": null, 
    "offload_optimizer": null, 
    "sub_group_size": 1.000000e+09, 
    "prefetch_bucket_size": 1.000000e+07, 
    "param_persistence_threshold": 1.000000e+05, 
    "max_live_parameters": 1.000000e+09, 
    "max_reuse_distance": 1.000000e+09, 
    "gather_fp16_weights_on_model_save": false, 
    "ignore_unused_parameters": true, 
    "legacy_stage1": false
}
[2023-03-03 02:22:03,752] [INFO] [config.py:904:print]   zero_enabled ................. True
[2023-03-03 02:22:03,752] [INFO] [config.py:904:print]   zero_optimization_stage ...... 3
[2023-03-03 02:22:03,752] [INFO] [config.py:911:print]   json = {
    "train_batch_size": 64, 
    "gradient_accumulation_steps": 1, 
    "steps_per_print": 1, 
    "zero_optimization": {
        "stage": 3, 
        "stage3_max_live_parameters": 1.000000e+09, 
        "stage3_max_reuse_distance": 1.000000e+09, 
        "stage3_prefetch_bucket_size": 1.000000e+07, 
        "stage3_param_persistence_threshold": 1.000000e+05, 
        "reduce_bucket_size": 1.000000e+07, 
        "contiguous_gradients": true
    }, 
    "gradient_clipping": 1.0, 
    "fp16": {
        "enabled": true, 
        "loss_scale": 0, 
        "loss_scale_window": 1000, 
        "hysteresis": 2, 
        "min_loss_scale": 1
    }, 
    "wall_clock_breakdown": true, 
    "zero_allow_untested_optimizer": false
}
> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      1280000
    validation: 6440
    test:       40
> building train, validation, and test datasets for GPT2 ...
 > building dataset index ...
    reading sizes...
    reading pointers...
    reading document index...
    creating numpy buffer of mmap...
    creating memory view of numpy buffer...
 > finished creating indexed dataset in 0.000443 seconds
    number of documents: 2761
 > dataset split:
    train:
     document indices in [0, 2620) total of 2620 documents
    validation:
     document indices in [2620, 2758) total of 138 documents
    test:
     document indices in [2758, 2761) total of 3 documents
 > loading doc-idx mapping from ../my-gpt2_text_document_train_indexmap_1280000ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from ../my-gpt2_text_document_train_indexmap_1280000ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from ../my-gpt2_text_document_train_indexmap_1280000ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of samples: 1280089
    total number of epochs: 8198
 > loading doc-idx mapping from ../my-gpt2_text_document_valid_indexmap_6440ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from ../my-gpt2_text_document_valid_indexmap_6440ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from ../my-gpt2_text_document_valid_indexmap_6440ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of samples: 6442
    total number of epochs: 892
 > loading doc-idx mapping from ../my-gpt2_text_document_test_indexmap_40ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from ../my-gpt2_text_document_test_indexmap_40ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from ../my-gpt2_text_document_test_indexmap_40ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of samples: 41
    total number of epochs: 49
> finished creating GPT2 datasets ...
setting training data start iteration to 0
setting validation data start iteration to 0
done with setups ...
time (ms) | model and optimizer: 539.74 | train/valid/test data iterators: 285.37
training ...

Test2: training using 2 nodes

In node1 (100.3.8.100), run ds_pretrain_gpt2-zero3.sh: hostfile:

100.3.8.100 slots=2
100.3.8.68 slots=2

Result:

root@15418839153d:/workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples# bash ds_pretrain_gpt2-zero3.sh 
deepspeed --hostfile=myhostfile ../pretrain_gpt2.py --model-parallel-size 2 --num-layers 5 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --batch-size 4 --train-iters 320000 --lr-decay-iters 320000 --save ../gpt2_checkpoints/gpt2_ds_zero3 --data-path ../my-gpt2_text_document --vocab-file ../pretrain_models/gpt2/gpt2-vocab.json --merge-file ../pretrain_models/gpt2/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 1.5e-4 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --warmup 0.01 --checkpoint-activations --log-interval 1 --save-interval 10000 --eval-interval 2000 --eval-iters 10 --fp16 --scattered-embeddings --split-transformers --deepspeed --deepspeed_config /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_3_config.json --zero-stage 3 --zero-reduce-bucket-size 50000000 --zero-allgather-bucket-size 5000000000 --zero-contigious-gradients --zero-reduce-scatter --deepspeed-activation-checkpointing --checkpoint-num-layers 1 --partition-activations --checkpoint-in-cpu --synchronize-each-layer --contigious-checkpointing
[2023-03-03 02:15:23,176] [INFO] [runner.py:293:main] Using IP address of 100.3.8.100 for node 100.3.8.100
[2023-03-03 02:15:23,178] [INFO] [multinode_runner.py:51:get_cmd] Running on the following workers: 100.3.8.100,100.3.8.68
[2023-03-03 02:15:23,178] [INFO] [runner.py:360:main] cmd = pdsh -f 1024 -w 100.3.8.100,100.3.8.68 export NCCL_VERSION=2.7.8; export PYTHONPATH=/workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples;  cd /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples; /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyIxMDAuMy44LjEwMCI6IFswLCAxXSwgIjEwMC4zLjguNjgiOiBbMCwgMV19 --node_rank=%n --master_addr=100.3.8.100 --master_port=29500 ../pretrain_gpt2.py --model-parallel-size '2' --num-layers '5' --hidden-size '1024' --num-attention-heads '16' --seq-length '1024' --max-position-embeddings '1024' --batch-size '4' --train-iters '320000' --lr-decay-iters '320000' --save '../gpt2_checkpoints/gpt2_ds_zero3' --data-path '../my-gpt2_text_document' --vocab-file '../pretrain_models/gpt2/gpt2-vocab.json' --merge-file '../pretrain_models/gpt2/gpt2-merges.txt' --data-impl 'mmap' --split '949,50,1' --distributed-backend 'nccl' --lr '1.5e-4' --lr-decay-style 'cosine' --min-lr '1.0e-5' --weight-decay '1e-2' --clip-grad '1.0' --warmup '0.01' --checkpoint-activations --log-interval '1' --save-interval '10000' --eval-interval '2000' --eval-iters '10' --fp16 --scattered-embeddings --split-transformers --deepspeed --deepspeed_config '/workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_3_config.json' --zero-stage '3' --zero-reduce-bucket-size '50000000' --zero-allgather-bucket-size '5000000000' --zero-contigious-gradients --zero-reduce-scatter --deepspeed-activation-checkpointing --checkpoint-num-layers '1' --partition-activations --checkpoint-in-cpu --synchronize-each-layer --contigious-checkpointing
100.3.8.100: bash: line 0: cd: /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples: No such file or directory
100.3.8.100: bash: /opt/conda/bin/python: No such file or directory
pdsh@15418839153d: 100.3.8.100: ssh exited with exit code 127
100.3.8.68: /etc/profile.d/lang.sh: line 19: warning: setlocale: LC_CTYPE: cannot change locale (C.UTF-8)
100.3.8.68: bash: line 0: cd: /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples: No such file or directory
100.3.8.68: bash: /opt/conda/bin/python: No such file or directory
pdsh@15418839153d: 100.3.8.68: ssh exited with exit code 127

bing0037 commented 1 year ago

@tjruwase hi, do you have any suggestions?

zincnode commented 1 year ago

Hi, @bing0037 I did something similar before (testing ZeRO on a multi-node docker env). Here are some thoughts or suggestions:

The server (outside the container, hereinafter referred to as: Host) and docker env (inside the container, hereinafter referred to as: Container) can be considered as two completely independent environments (or the operating system, you even regard the Container as a virtual machine ), which is very imprecise but better understood.
I guess the IP you mentioned (100.3.8.XXX) is the IP of Host. When you start the docker container, if you do not specify the network parameter (--net), the default is to use the Bridge network. At this point, there will be two IPs, one is Host IP (100.3.8.XXX), and the other is Container IP (typically: 172.XXX.XXX.XXX). The two IPs point to the two independent environments mentioned above. So when you specify the Deepspeed hostfile, you should actually use the Container IP instead of the Host IP. When you use Container IP, Deepspeed will establish an ssh connection, read files and execute scripts on your Host, but there is no /workspace and /opt/conda/bin/python (These paths are for Container) , which is the error message. However, the current Container IP specified in the hostfile should still not work, continue to look down.
As you know, distributed training of Deepspeed needs to use ssh, which requires communication between nodes (in short, ping). At present, you start the container using the Bridge network, and the two containers should be unable to ping (both containers use 172.XXX.XXX.XXX). There are roughly two solutions:
- Continue to use Bridge network. Then you need to manually configure the routing table, you can try, but I don't recommend it.
- Use Overlay network. The official summary of it: Overlay networks are best when you need containers running on different Docker hosts to communicate, or when multiple applications work together using swarm services. Obviously, the Overlay network is more in line with this usage scenario. For detailed configuration instructions, please refer to the official documentation. After using it, the typical Container IP is: 10.0.XXX.XXX. Using this Container IP, containers on different nodes will be able to communicate, but using ssh also needs to ensure that the ssh service is installed and enabled.
A typical GPU cluster will use InfiniBand to accelerate communication between multiple nodes. I'm not sure if the cluster you're using has. If using InfiniBand, add --privilege when starting the container. This will make InfiniBand available inside the container. This worked for me, of course this requires making sure InfiniBand is available on the Host.
FYI, this is how I use docker while testing.

Start the container
```
docker run --gpus all --name AAAAA --net=BBBBB -itd --privileged -v CCCCC DDDDD bash
```
AAAAA: name of docker container BBBBB: name of the created Overlay network CCCCC: path DDDDD: name of docker image

Enter the container
```
docker exec -it AAAAA bash
```
Exit the container(The container is still running after exiting, unless you execute docker container stop AAAAA )
```
exit
```
The above thoughts are based on the information you provided and my experience, there may be inaccuracies or mistakes.😂

Additional materials:

Docker network doc.

bnuzhanyu commented 1 year ago

Hi, @bing0037 I did something similar before (testing ZeRO on a multi-node docker env). Here are some thoughts or suggestions:
The server (outside the container, hereinafter referred to as: Host) and docker env (inside the container, hereinafter referred to as: Container) can be considered as two completely independent environments (or the operating system, you even regard the Container as a virtual machine ), which is very imprecise but better understood.

I guess the IP you mentioned (100.3.8.XXX) is the IP of Host. When you start the docker container, if you do not specify the network parameter (--net), the default is to use the Bridge network. At this point, there will be two IPs, one is Host IP (100.3.8.XXX), and the other is Container IP (typically: 172.XXX.XXX.XXX). The two IPs point to the two independent environments mentioned above. So when you specify the Deepspeed hostfile, you should actually use the Container IP instead of the Host IP. When you use Container IP, Deepspeed will establish an ssh connection, read files and execute scripts on your Host, but there is no /workspace and /opt/conda/bin/python (These paths are for Container) , which is the error message. However, the current Container IP specified in the hostfile should still not work, continue to look down.

As you know, distributed training of Deepspeed needs to use ssh, which requires communication between nodes (in short, ping). At present, you start the container using the Bridge network, and the two containers should be unable to ping (both containers use 172.XXX.XXX.XXX). There are roughly two solutions:

Continue to use Bridge network. Then you need to manually configure the routing table, you can try, but I don't recommend it.

Use Overlay network. The official summary of it: Overlay networks are best when you need containers running on different Docker hosts to communicate, or when multiple applications work together using swarm services. Obviously, the Overlay network is more in line with this usage scenario. For detailed configuration instructions, please refer to the official documentation. After using it, the typical Container IP is: 10.0.XXX.XXX. Using this Container IP, containers on different nodes will be able to communicate, but using ssh also needs to ensure that the ssh service is installed and enabled.

A typical GPU cluster will use InfiniBand to accelerate communication between multiple nodes. I'm not sure if the cluster you're using has. If using InfiniBand, add --privilege when starting the container. This will make InfiniBand available inside the container. This worked for me, of course this requires making sure InfiniBand is available on the Host.
FYI, this is how I use docker while testing. Start the container
docker run --gpus all --name AAAAA --net=BBBBB -itd --privileged -v CCCCC DDDDD bash
AAAAA: name of docker container BBBBB: name of the created Overlay network CCCCC: path DDDDD: name of docker image

Enter the container
docker exec -it AAAAA bash
Exit the container(The container is still running after exiting, unless you execute docker container stop AAAAA )
exit
The above thoughts are based on the information you provided and my experience, there may be inaccuracies or mistakes.😂

Additional materials:

Docker network doc.

Thanks for your answer, it seems a reasonable way to solve the problem, but I wonder is there any way to run cmds (different in rank/role/port) in all docker env and then they start communication so that ssh is not required, like tensorflow's distributed training. By this I can run the distributed training in kubernetes.

loadams commented 10 months ago

This seems like the best answer so far, and the issue is fairly stale. Closing for now, if folks have other suggestions, please post here, if you have other questions, please open an issue so we can see it and reply. Thanks!

ray-008 commented 7 months ago

Official document creates overlay network： use-an-overlay-network-for-standalone-containers

Then the two containers can realize cross-machine communication through this network, and the rest is password-free login, environment configuration, etc.

microsoft / DeepSpeed

[Question] How to launch jobs with Docker env using multiple nodes in DeepSpeed? #2920

Here are all my tests and results:

Test 1: test docker containers in node1(100.3.8.100) and node2 (100.3.8.68):

node1:

Result:

node2: the code directory in host is different from node1, but is the same in docker env.

Result:

Test2: training using 2 nodes

Result: