Closed bing0037 closed 10 months ago
@bing0037, did you verify that those paths are valid in both containers?
@bing0037, did you verify that those paths are valid in both containers?
Thanks for your reply.
docker run -it --gpus all -v /home/libn/DeepSpeedExamples:/workspace deepspeed:latest
cd /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples
bash ds_pretrain_gpt2-zero3.sh
root@15418839153d:/workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples# bash ds_pretrain_gpt2-zero3.sh
deepspeed --include=localhost:1,2,6,7 ../pretrain_gpt2.py --model-parallel-size 2 --num-layers 5 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --batch-size 4 --train-iters 320000 --lr-decay-iters 320000 --save ../gpt2_checkpoints/gpt2_ds_zero3 --data-path ../my-gpt2_text_document --vocab-file ../pretrain_models/gpt2/gpt2-vocab.json --merge-file ../pretrain_models/gpt2/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 1.5e-4 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --warmup 0.01 --checkpoint-activations --log-interval 1 --save-interval 10000 --eval-interval 2000 --eval-iters 10 --fp16 --scattered-embeddings --split-transformers --deepspeed --deepspeed_config /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_3_config.json --zero-stage 3 --zero-reduce-bucket-size 50000000 --zero-allgather-bucket-size 5000000000 --zero-contigious-gradients --zero-reduce-scatter --deepspeed-activation-checkpointing --checkpoint-num-layers 1 --partition-activations --checkpoint-in-cpu --synchronize-each-layer --contigious-checkpointing
[2023-03-03 02:20:20,137] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-03-03 02:20:20,970] [INFO] [runner.py:360:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMSwgMiwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 ../pretrain_gpt2.py --model-parallel-size 2 --num-layers 5 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --batch-size 4 --train-iters 320000 --lr-decay-iters 320000 --save ../gpt2_checkpoints/gpt2_ds_zero3 --data-path ../my-gpt2_text_document --vocab-file ../pretrain_models/gpt2/gpt2-vocab.json --merge-file ../pretrain_models/gpt2/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 1.5e-4 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --warmup 0.01 --checkpoint-activations --log-interval 1 --save-interval 10000 --eval-interval 2000 --eval-iters 10 --fp16 --scattered-embeddings --split-transformers --deepspeed --deepspeed_config /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_3_config.json --zero-stage 3 --zero-reduce-bucket-size 50000000 --zero-allgather-bucket-size 5000000000 --zero-contigious-gradients --zero-reduce-scatter --deepspeed-activation-checkpointing --checkpoint-num-layers 1 --partition-activations --checkpoint-in-cpu --synchronize-each-layer --contigious-checkpointing
[2023-03-03 02:20:21,988] [INFO] [launch.py:73:main] 0 NCCL_VERSION 2.7.8
[2023-03-03 02:20:21,988] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [1, 2, 6, 7]}
[2023-03-03 02:20:21,988] [INFO] [launch.py:89:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-03-03 02:20:21,988] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-03-03 02:20:21,988] [INFO] [launch.py:102:main] dist_world_size=4
[2023-03-03 02:20:21,988] [INFO] [launch.py:105:main] Setting CUDA_VISIBLE_DEVICES=1,2,6,7
using world size: 4 and model-parallel size: 2
using torch.float16 for parameters ...
-------------------- arguments --------------------
adam_beta1 ...................... 0.9
adam_beta2 ...................... 0.999
adam_eps ........................ 1e-08
adlr_autoresume ................. False
adlr_autoresume_interval ........ 1000
apply_query_key_layer_scaling ... False
apply_residual_connection_post_layernorm False
attention_dropout ............... 0.1
attention_softmax_in_fp32 ....... False
batch_size ...................... 4
bert_load ....................... None
bias_dropout_fusion ............. False
bias_gelu_fusion ................ False
block_data_path ................. None
checkpoint_activations .......... True
checkpoint_in_cpu ............... True
checkpoint_num_layers ........... 1
clip_grad ....................... 1.0
contigious_checkpointing ........ True
cpu_optimizer ................... False
cpu_torch_adam .................. False
data_impl ....................... mmap
data_path ....................... ../my-gpt2_text_document
DDP_impl ........................ local
deepscale ....................... False
deepscale_config ................ None
deepspeed ....................... True
deepspeed_activation_checkpointing True
deepspeed_config ................ /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_3_config.json
deepspeed_mpi ................... False
distribute_checkpointed_activations False
distributed_backend ............. nccl
dynamic_loss_scale .............. True
eod_mask_loss ................... False
eval_interval ................... 2000
eval_iters ...................... 10
exit_interval ................... None
faiss_use_gpu ................... False
finetune ........................ False
fp16 ............................ True
fp16_lm_cross_entropy ........... False
fp32_allreduce .................. False
hidden_dropout .................. 0.1
hidden_size ..................... 1024
hysteresis ...................... 2
ict_head_size ................... None
ict_load ........................ None
indexer_batch_size .............. 128
indexer_log_interval ............ 1000
init_method_std ................. 0.02
layernorm_epsilon ............... 1e-05
lazy_mpu_init ................... None
load ............................ None
local_rank ...................... 0
log_interval .................... 1
loss_scale ...................... None
loss_scale_window ............... 1000
lr .............................. 0.00015
lr_decay_iters .................. 320000
lr_decay_style .................. cosine
lr_decay_tokens ................. None
make_vocab_size_divisible_by .... 128
mask_prob ....................... 0.15
max_position_embeddings ......... 1024
memory_centric_tiled_linear ..... False
merge_file ...................... ../pretrain_models/gpt2/gpt2-merges.txt
min_lr .......................... 1e-05
min_scale ....................... 1
mmap_warmup ..................... False
model_parallel_size ............. 2
no_load_optim ................... False
no_load_rng ..................... False
no_save_optim ................... False
no_save_rng ..................... False
num_attention_heads ............. 16
num_layers ...................... 5
num_unique_layers ............... None
num_workers ..................... 2
onnx_safe ....................... None
openai_gelu ..................... False
override_lr_scheduler ........... False
param_sharing_style ............. grouped
params_dtype .................... torch.float16
partition_activations ........... True
profile_backward ................ False
query_in_block_prob ............. 0.1
rank ............................ 0
remote_device ................... none
report_topk_accuracies .......... []
reset_attention_mask ............ False
reset_position_ids .............. False
save ............................ ../gpt2_checkpoints/gpt2_ds_zero3
save_interval ................... 10000
scaled_masked_softmax_fusion .... False
scaled_upper_triang_masked_softmax_fusion False
scattered_embeddings ............ True
seed ............................ 1234
seq_length ...................... 1024
short_seq_prob .................. 0.1
split ........................... 949,50,1
split_transformers .............. True
synchronize_each_layer .......... True
tensorboard_dir ................. None
tile_factor ..................... 1
titles_data_path ................ None
tokenizer_type .................. GPT2BPETokenizer
tokens .......................... 0
train_iters ..................... 320000
train_tokens .................... None
use_checkpoint_lr_scheduler ..... False
use_cpu_initialization .......... False
use_one_sent_docs ............... False
use_pin_memory .................. False
vocab_file ...................... ../pretrain_models/gpt2/gpt2-vocab.json
warmup .......................... 0.01
warmup_iters .................... None
weight_decay .................... 0.01
world_size ...................... 4
zero_allgather_bucket_size ...... 5000000000
zero_contigious_gradients ....... True
zero_reduce_bucket_size ......... 50000000
zero_reduce_scatter ............. True
zero_stage ...................... 3
---------------- end of arguments ----------------
> building GPT2BPETokenizer tokenizer ...
> padded vocab (size: 50257) with 175 dummy tokens (new size: 50432)
> initializing torch distributed ...
> initializing model parallel with size 2
> setting random seeds to 1234 ...
[2023-03-03 02:20:24,686] [INFO] [checkpointing.py:231:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT2 model ...
[2023-03-03 02:20:28,050] [INFO] [utils.py:588:see_memory_usage] Before Building Model
/opt/conda/lib/python3.7/site-packages/torch/cuda/memory.py:386: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
FutureWarning)
/opt/conda/lib/python3.7/site-packages/torch/cuda/memory.py:394: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
FutureWarning)
[2023-03-03 02:20:28,051] [INFO] [utils.py:593:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2023-03-03 02:20:28,051] [INFO] [utils.py:598:see_memory_usage] CPU Virtual Memory: used = 20.86 GB, percent = 2.8%
> number of parameters on model parallel rank 1 0.117 Billion
[2023-03-03 02:20:28,734] [INFO] [utils.py:588:see_memory_usage] After Building Model
[2023-03-03 02:20:28,735] [INFO] [utils.py:593:see_memory_usage] MA 0.06 GB Max_MA 0.07 GB CA 0.08 GB Max_CA 0 GB
[2023-03-03 02:20:28,735] [INFO] [utils.py:598:see_memory_usage] CPU Virtual Memory: used = 21.01 GB, percent = 2.8%
> number of parameters on model parallel rank 0 0.117 Billion
> learning rate decay style: cosine
DeepSpeed is enabled.
[2023-03-03 02:20:28,736] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.4.3, git-hash=unknown, git-branch=unknown
[2023-03-03 02:20:28,740] [INFO] [engine.py:180:__init__] DeepSpeed Flops Profiler Enabled: True
[2023-03-03 02:20:28,740] [INFO] [engine.py:700:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2023-03-03 02:20:28,740] [INFO] [engine.py:704:_configure_optimizer] Using client Optimizer as basic optimizer
[2023-03-03 02:20:28,740] [INFO] [engine.py:714:_configure_optimizer] DeepSpeed Basic Optimizer = AdamW
[2023-03-03 02:20:28,740] [INFO] [utils.py:44:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-03-03 02:20:28,740] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
[2023-03-03 02:20:28,740] [INFO] [engine.py:938:_configure_zero_optimizer] Initializing ZeRO Stage 3
[2023-03-03 02:20:28,743] [INFO] [stage3.py:633:__init__] Reduce bucket size 10000000.0
[2023-03-03 02:20:28,743] [INFO] [stage3.py:634:__init__] Allgather bucket size 10000000.0
[2023-03-03 02:20:30,016] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | init_optimizer_state: 7.66
[2023-03-03 02:20:30,016] [INFO] [stage3.py:825:__init__] optimizer state initialized
[2023-03-03 02:20:30,032] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-03-03 02:20:30,032] [INFO] [engine.py:516:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2023-03-03 02:20:30,033] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x7fdea6b68d50>
[2023-03-03 02:20:30,033] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2023-03-03 02:20:30,033] [INFO] [config.py:900:print] DeepSpeedEngine configuration:
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] allreduce_always_fp32 ........ False
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] amp_enabled .................. False
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] amp_params ................... False
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] checkpoint_tag_validation_enabled True
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] checkpoint_tag_validation_fail False
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] disable_allgather ............ False
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] dump_state ................... False
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] eigenvalue_enabled ........... False
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] eigenvalue_gas_boundary_resolution 1
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] eigenvalue_layer_num ......... 0
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] eigenvalue_max_iter .......... 100
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] eigenvalue_stability ......... 1e-06
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] eigenvalue_tol ............... 0.01
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] eigenvalue_verbose ........... False
[2023-03-03 02:20:30,033] [INFO] [config.py:904:print] elasticity_enabled ........... False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] flops_profiler_config ........ {
"enabled": true,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] fp16_enabled ................. True
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] fp16_mixed_quantize .......... False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] global_rank .................. 0
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] gradient_accumulation_steps .. 1
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] gradient_clipping ............ 1.0
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] gradient_predivide_factor .... 1.0
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] initial_dynamic_scale ........ 4294967296
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] loss_scale ................... 0
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] memory_breakdown ............. False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] optimizer_legacy_fusion ...... False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] optimizer_name ............... None
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] optimizer_params ............. None
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] pld_enabled .................. False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] pld_params ................... False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] prescale_gradients ........... False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] quantize_change_rate ......... 0.001
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] quantize_groups .............. 1
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] quantize_offset .............. 1000
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] quantize_period .............. 1000
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] quantize_rounding ............ 0
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] quantize_start_bits .......... 16
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] quantize_target_bits ......... 8
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] quantize_training_enabled .... False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] quantize_type ................ 0
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] quantize_verbose ............. False
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] scheduler_name ............... None
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] scheduler_params ............. None
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] sparse_attention ............. None
[2023-03-03 02:20:30,034] [INFO] [config.py:904:print] sparse_gradients_enabled ..... False
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print] steps_per_print .............. 1
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print] tensorboard_enabled .......... False
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print] tensorboard_job_name ......... DeepSpeedJobName
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print] tensorboard_output_path ......
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print] train_batch_size ............. 64
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print] train_micro_batch_size_per_gpu 32
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print] use_quantizer_kernel ......... False
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print] wall_clock_breakdown ......... True
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print] world_size ................... 2
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print] zero_allow_untested_optimizer False
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print] zero_config .................. {
"stage": 3,
"contiguous_gradients": true,
"reduce_scatter": true,
"reduce_bucket_size": 1.000000e+07,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"overlap_comm": true,
"load_from_fp32_weights": true,
"elastic_checkpoint": true,
"offload_param": null,
"offload_optimizer": null,
"sub_group_size": 1.000000e+09,
"prefetch_bucket_size": 1.000000e+07,
"param_persistence_threshold": 1.000000e+05,
"max_live_parameters": 1.000000e+09,
"max_reuse_distance": 1.000000e+09,
"gather_fp16_weights_on_model_save": false,
"ignore_unused_parameters": true,
"legacy_stage1": false
}
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print] zero_enabled ................. True
[2023-03-03 02:20:30,035] [INFO] [config.py:904:print] zero_optimization_stage ...... 3
[2023-03-03 02:20:30,035] [INFO] [config.py:911:print] json = {
"train_batch_size": 64,
"gradient_accumulation_steps": 1,
"steps_per_print": 1,
"zero_optimization": {
"stage": 3,
"stage3_max_live_parameters": 1.000000e+09,
"stage3_max_reuse_distance": 1.000000e+09,
"stage3_prefetch_bucket_size": 1.000000e+07,
"stage3_param_persistence_threshold": 1.000000e+05,
"reduce_bucket_size": 1.000000e+07,
"contiguous_gradients": true
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"wall_clock_breakdown": true,
"comms_logger": {
"enabled": true,
"verbose": true,
"prof_all": false,
"debug": false,
"prof_ops": ["all_reduce", "all_gather"]
},
"zero_allow_untested_optimizer": false,
"flops_profiler": {
"enabled": true,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
}
> building train, validation, and test datasets ...
> datasets target sizes (minimum size):
train: 2560000
validation: 12880
test: 80
> building train, validation, and test datasets for GPT2 ...
> building dataset index ...
reading sizes...
reading pointers...
reading document index...
creating numpy buffer of mmap...
creating memory view of numpy buffer...
> finished creating indexed dataset in 0.000291 seconds
number of documents: 2761
> dataset split:
train:
document indices in [0, 2620) total of 2620 documents
validation:
document indices in [2620, 2758) total of 138 documents
test:
document indices in [2758, 2761) total of 3 documents
> loading doc-idx mapping from ../my-gpt2_text_document_train_indexmap_2560000ns_1024sl_1234s_doc_idx.npy
> loading sample-idx mapping from ../my-gpt2_text_document_train_indexmap_2560000ns_1024sl_1234s_sample_idx.npy
> loading shuffle-idx mapping from ../my-gpt2_text_document_train_indexmap_2560000ns_1024sl_1234s_shuffle_idx.npy
loaded indexed file in 0.001 seconds
total number of samples: 2560022
total number of epochs: 16395
> loading doc-idx mapping from ../my-gpt2_text_document_valid_indexmap_12880ns_1024sl_1234s_doc_idx.npy
> loading sample-idx mapping from ../my-gpt2_text_document_valid_indexmap_12880ns_1024sl_1234s_sample_idx.npy
> loading shuffle-idx mapping from ../my-gpt2_text_document_valid_indexmap_12880ns_1024sl_1234s_shuffle_idx.npy
loaded indexed file in 0.001 seconds
total number of samples: 12884
total number of epochs: 1784
> loading doc-idx mapping from ../my-gpt2_text_document_test_indexmap_80ns_1024sl_1234s_doc_idx.npy
> loading sample-idx mapping from ../my-gpt2_text_document_test_indexmap_80ns_1024sl_1234s_sample_idx.npy
> loading shuffle-idx mapping from ../my-gpt2_text_document_test_indexmap_80ns_1024sl_1234s_shuffle_idx.npy
loaded indexed file in 0.001 seconds
total number of samples: 81
total number of epochs: 97
> finished creating GPT2 datasets ...
setting training data start iteration to 0
setting validation data start iteration to 0
done with setups ...
time (ms) | model and optimizer: 2032.63 | train/valid/test data iterators: 897.19
training ...
docker run -it --gpus all -v /home/libn/code/Distributed/DeepSpeedExamples:/workspace deepspeed:latest
cd /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples
bash ds_pretrain_gpt2-zero3.sh
root@77b68329c514:/workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples# bash ds_pretrain_gpt2-zero3.sh
deepspeed --num_nodes 1 --num_gpus 1 ../pretrain_gpt2.py --model-parallel-size 1 --num-layers 5 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --batch-size 4 --train-iters 320000 --lr-decay-iters 320000 --save ../gpt2_checkpoints/gpt2_ds_zero3 --data-path ../my-gpt2_text_document --vocab-file ../pretrain_models/gpt2/gpt2-vocab.json --merge-file ../pretrain_models/gpt2/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 1.5e-4 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --warmup 0.01 --checkpoint-activations --log-interval 1 --save-interval 10000 --eval-interval 2000 --eval-iters 10 --fp16 --scattered-embeddings --split-transformers --deepspeed --deepspeed_config /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_3_config.json --zero-stage 3 --zero-reduce-bucket-size 50000000 --zero-allgather-bucket-size 5000000000 --zero-contigious-gradients --zero-reduce-scatter --deepspeed-activation-checkpointing --checkpoint-num-layers 1 --partition-activations --checkpoint-in-cpu --synchronize-each-layer --contigious-checkpointing
[2023-03-03 02:21:54,970] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-03-03 02:21:56,868] [INFO] [runner.py:360:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 ../pretrain_gpt2.py --model-parallel-size 1 --num-layers 5 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --batch-size 4 --train-iters 320000 --lr-decay-iters 320000 --save ../gpt2_checkpoints/gpt2_ds_zero3 --data-path ../my-gpt2_text_document --vocab-file ../pretrain_models/gpt2/gpt2-vocab.json --merge-file ../pretrain_models/gpt2/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 1.5e-4 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --warmup 0.01 --checkpoint-activations --log-interval 1 --save-interval 10000 --eval-interval 2000 --eval-iters 10 --fp16 --scattered-embeddings --split-transformers --deepspeed --deepspeed_config /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_3_config.json --zero-stage 3 --zero-reduce-bucket-size 50000000 --zero-allgather-bucket-size 5000000000 --zero-contigious-gradients --zero-reduce-scatter --deepspeed-activation-checkpointing --checkpoint-num-layers 1 --partition-activations --checkpoint-in-cpu --synchronize-each-layer --contigious-checkpointing
[2023-03-03 02:21:58,249] [INFO] [launch.py:73:main] 0 NCCL_VERSION 2.7.8
[2023-03-03 02:21:58,249] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0]}
[2023-03-03 02:21:58,250] [INFO] [launch.py:89:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-03-03 02:21:58,250] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-03-03 02:21:58,250] [INFO] [launch.py:102:main] dist_world_size=1
[2023-03-03 02:21:58,250] [INFO] [launch.py:105:main] Setting CUDA_VISIBLE_DEVICES=0
using world size: 1 and model-parallel size: 1
using torch.float16 for parameters ...
-------------------- arguments --------------------
adam_beta1 ...................... 0.9
adam_beta2 ...................... 0.999
adam_eps ........................ 1e-08
adlr_autoresume ................. False
adlr_autoresume_interval ........ 1000
apply_query_key_layer_scaling ... False
apply_residual_connection_post_layernorm False
attention_dropout ............... 0.1
attention_softmax_in_fp32 ....... False
batch_size ...................... 4
bert_load ....................... None
bias_dropout_fusion ............. False
bias_gelu_fusion ................ False
block_data_path ................. None
checkpoint_activations .......... True
checkpoint_in_cpu ............... True
checkpoint_num_layers ........... 1
clip_grad ....................... 1.0
contigious_checkpointing ........ True
cpu_optimizer ................... False
cpu_torch_adam .................. False
data_impl ....................... mmap
data_path ....................... ../my-gpt2_text_document
DDP_impl ........................ local
deepscale ....................... False
deepscale_config ................ None
deepspeed ....................... True
deepspeed_activation_checkpointing True
deepspeed_config ................ /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_3_config.json
deepspeed_mpi ................... False
distribute_checkpointed_activations False
distributed_backend ............. nccl
dynamic_loss_scale .............. True
eod_mask_loss ................... False
eval_interval ................... 2000
eval_iters ...................... 10
exit_interval ................... None
faiss_use_gpu ................... False
finetune ........................ False
fp16 ............................ True
fp16_lm_cross_entropy ........... False
fp32_allreduce .................. False
hidden_dropout .................. 0.1
hidden_size ..................... 1024
hysteresis ...................... 2
ict_head_size ................... None
ict_load ........................ None
indexer_batch_size .............. 128
indexer_log_interval ............ 1000
init_method_std ................. 0.02
layernorm_epsilon ............... 1e-05
lazy_mpu_init ................... None
load ............................ None
local_rank ...................... 0
log_interval .................... 1
loss_scale ...................... None
loss_scale_window ............... 1000
lr .............................. 0.00015
lr_decay_iters .................. 320000
lr_decay_style .................. cosine
lr_decay_tokens ................. None
make_vocab_size_divisible_by .... 128
mask_prob ....................... 0.15
max_position_embeddings ......... 1024
memory_centric_tiled_linear ..... False
merge_file ...................... ../pretrain_models/gpt2/gpt2-merges.txt
min_lr .......................... 1e-05
min_scale ....................... 1
mmap_warmup ..................... False
model_parallel_size ............. 1
no_load_optim ................... False
no_load_rng ..................... False
no_save_optim ................... False
no_save_rng ..................... False
num_attention_heads ............. 16
num_layers ...................... 5
num_unique_layers ............... None
num_workers ..................... 2
onnx_safe ....................... None
openai_gelu ..................... False
override_lr_scheduler ........... False
param_sharing_style ............. grouped
params_dtype .................... torch.float16
partition_activations ........... True
profile_backward ................ False
query_in_block_prob ............. 0.1
rank ............................ 0
remote_device ................... none
report_topk_accuracies .......... []
reset_attention_mask ............ False
reset_position_ids .............. False
save ............................ ../gpt2_checkpoints/gpt2_ds_zero3
save_interval ................... 10000
scaled_masked_softmax_fusion .... False
scaled_upper_triang_masked_softmax_fusion False
scattered_embeddings ............ True
seed ............................ 1234
seq_length ...................... 1024
short_seq_prob .................. 0.1
split ........................... 949,50,1
split_transformers .............. True
synchronize_each_layer .......... True
tensorboard_dir ................. None
tile_factor ..................... 1
titles_data_path ................ None
tokenizer_type .................. GPT2BPETokenizer
tokens .......................... 0
train_iters ..................... 320000
train_tokens .................... None
use_checkpoint_lr_scheduler ..... False
use_cpu_initialization .......... False
use_one_sent_docs ............... False
use_pin_memory .................. False
vocab_file ...................... ../pretrain_models/gpt2/gpt2-vocab.json
warmup .......................... 0.01
warmup_iters .................... None
weight_decay .................... 0.01
world_size ...................... 1
zero_allgather_bucket_size ...... 5000000000
zero_contigious_gradients ....... True
zero_reduce_bucket_size ......... 50000000
zero_reduce_scatter ............. True
zero_stage ...................... 3
---------------- end of arguments ----------------
> building GPT2BPETokenizer tokenizer ...
> padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing model parallel with size 1
> setting random seeds to 1234 ...
[2023-03-03 02:22:00,223] [INFO] [checkpointing.py:231:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT2 model ...
[2023-03-03 02:22:03,259] [INFO] [utils.py:588:see_memory_usage] Before Building Model
/opt/conda/lib/python3.7/site-packages/torch/cuda/memory.py:386: FutureWarning: torch.cuda.memory_cached has been renamed to torch.cuda.memory_reserved
FutureWarning)
/opt/conda/lib/python3.7/site-packages/torch/cuda/memory.py:394: FutureWarning: torch.cuda.max_memory_cached has been renamed to torch.cuda.max_memory_reserved
FutureWarning)
[2023-03-03 02:22:03,260] [INFO] [utils.py:593:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB
[2023-03-03 02:22:03,260] [INFO] [utils.py:598:see_memory_usage] CPU Virtual Memory: used = 11.13 GB, percent = 1.5%
[2023-03-03 02:22:03,485] [INFO] [utils.py:588:see_memory_usage] After Building Model
[2023-03-03 02:22:03,486] [INFO] [utils.py:593:see_memory_usage] MA 0.22 GB Max_MA 0.22 GB CA 0.26 GB Max_CA 0 GB
[2023-03-03 02:22:03,486] [INFO] [utils.py:598:see_memory_usage] CPU Virtual Memory: used = 11.17 GB, percent = 1.5%
> number of parameters on model parallel rank 0 0.116 Billion
> learning rate decay style: cosine
DeepSpeed is enabled.
[2023-03-03 02:22:03,487] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.4.3, git-hash=unknown, git-branch=unknown
[2023-03-03 02:22:03,491] [INFO] [engine.py:180:__init__] DeepSpeed Flops Profiler Enabled: False
[2023-03-03 02:22:03,491] [INFO] [engine.py:700:_configure_optimizer] Removing param_group that has no 'params' in the client Optimizer
[2023-03-03 02:22:03,491] [INFO] [engine.py:704:_configure_optimizer] Using client Optimizer as basic optimizer
[2023-03-03 02:22:03,491] [INFO] [engine.py:714:_configure_optimizer] DeepSpeed Basic Optimizer = AdamW
[2023-03-03 02:22:03,491] [INFO] [utils.py:44:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2023-03-03 02:22:03,491] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
[2023-03-03 02:22:03,491] [INFO] [engine.py:938:_configure_zero_optimizer] Initializing ZeRO Stage 3
[2023-03-03 02:22:03,494] [INFO] [stage3.py:633:__init__] Reduce bucket size 10000000.0
[2023-03-03 02:22:03,494] [INFO] [stage3.py:634:__init__] Allgather bucket size 10000000.0
[2023-03-03 02:22:03,731] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | init_optimizer_state: 15.07
[2023-03-03 02:22:03,731] [INFO] [stage3.py:825:__init__] optimizer state initialized
[2023-03-03 02:22:03,748] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2023-03-03 02:22:03,748] [INFO] [engine.py:516:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2023-03-03 02:22:03,748] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x7f44ae4a6dd0>
[2023-03-03 02:22:03,748] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
[2023-03-03 02:22:03,748] [INFO] [config.py:900:print] DeepSpeedEngine configuration:
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] allreduce_always_fp32 ........ False
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] amp_enabled .................. False
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] amp_params ................... False
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] checkpoint_tag_validation_enabled True
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] checkpoint_tag_validation_fail False
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] disable_allgather ............ False
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] dump_state ................... False
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] eigenvalue_enabled ........... False
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] eigenvalue_gas_boundary_resolution 1
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] eigenvalue_layer_name ........ bert.encoder.layer
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] eigenvalue_layer_num ......... 0
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] eigenvalue_max_iter .......... 100
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] eigenvalue_stability ......... 1e-06
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] eigenvalue_tol ............... 0.01
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] eigenvalue_verbose ........... False
[2023-03-03 02:22:03,749] [INFO] [config.py:904:print] elasticity_enabled ........... False
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] flops_profiler_config ........ {
"enabled": false,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] fp16_enabled ................. True
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] fp16_mixed_quantize .......... False
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] global_rank .................. 0
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] gradient_accumulation_steps .. 1
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] gradient_clipping ............ 1.0
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] gradient_predivide_factor .... 1.0
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] initial_dynamic_scale ........ 4294967296
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] loss_scale ................... 0
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] memory_breakdown ............. False
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] optimizer_legacy_fusion ...... False
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] optimizer_name ............... None
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] optimizer_params ............. None
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] pld_enabled .................. False
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] pld_params ................... False
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] prescale_gradients ........... False
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] quantize_change_rate ......... 0.001
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] quantize_groups .............. 1
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] quantize_offset .............. 1000
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] quantize_period .............. 1000
[2023-03-03 02:22:03,750] [INFO] [config.py:904:print] quantize_rounding ............ 0
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] quantize_start_bits .......... 16
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] quantize_target_bits ......... 8
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] quantize_training_enabled .... False
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] quantize_type ................ 0
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] quantize_verbose ............. False
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] scheduler_name ............... None
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] scheduler_params ............. None
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] sparse_attention ............. None
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] sparse_gradients_enabled ..... False
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] steps_per_print .............. 1
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] tensorboard_enabled .......... False
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] tensorboard_job_name ......... DeepSpeedJobName
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] tensorboard_output_path ......
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] train_batch_size ............. 64
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] train_micro_batch_size_per_gpu 64
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] use_quantizer_kernel ......... False
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] wall_clock_breakdown ......... True
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] world_size ................... 1
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] zero_allow_untested_optimizer False
[2023-03-03 02:22:03,751] [INFO] [config.py:904:print] zero_config .................. {
"stage": 3,
"contiguous_gradients": true,
"reduce_scatter": true,
"reduce_bucket_size": 1.000000e+07,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"overlap_comm": true,
"load_from_fp32_weights": true,
"elastic_checkpoint": true,
"offload_param": null,
"offload_optimizer": null,
"sub_group_size": 1.000000e+09,
"prefetch_bucket_size": 1.000000e+07,
"param_persistence_threshold": 1.000000e+05,
"max_live_parameters": 1.000000e+09,
"max_reuse_distance": 1.000000e+09,
"gather_fp16_weights_on_model_save": false,
"ignore_unused_parameters": true,
"legacy_stage1": false
}
[2023-03-03 02:22:03,752] [INFO] [config.py:904:print] zero_enabled ................. True
[2023-03-03 02:22:03,752] [INFO] [config.py:904:print] zero_optimization_stage ...... 3
[2023-03-03 02:22:03,752] [INFO] [config.py:911:print] json = {
"train_batch_size": 64,
"gradient_accumulation_steps": 1,
"steps_per_print": 1,
"zero_optimization": {
"stage": 3,
"stage3_max_live_parameters": 1.000000e+09,
"stage3_max_reuse_distance": 1.000000e+09,
"stage3_prefetch_bucket_size": 1.000000e+07,
"stage3_param_persistence_threshold": 1.000000e+05,
"reduce_bucket_size": 1.000000e+07,
"contiguous_gradients": true
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"wall_clock_breakdown": true,
"zero_allow_untested_optimizer": false
}
> building train, validation, and test datasets ...
> datasets target sizes (minimum size):
train: 1280000
validation: 6440
test: 40
> building train, validation, and test datasets for GPT2 ...
> building dataset index ...
reading sizes...
reading pointers...
reading document index...
creating numpy buffer of mmap...
creating memory view of numpy buffer...
> finished creating indexed dataset in 0.000443 seconds
number of documents: 2761
> dataset split:
train:
document indices in [0, 2620) total of 2620 documents
validation:
document indices in [2620, 2758) total of 138 documents
test:
document indices in [2758, 2761) total of 3 documents
> loading doc-idx mapping from ../my-gpt2_text_document_train_indexmap_1280000ns_1024sl_1234s_doc_idx.npy
> loading sample-idx mapping from ../my-gpt2_text_document_train_indexmap_1280000ns_1024sl_1234s_sample_idx.npy
> loading shuffle-idx mapping from ../my-gpt2_text_document_train_indexmap_1280000ns_1024sl_1234s_shuffle_idx.npy
loaded indexed file in 0.001 seconds
total number of samples: 1280089
total number of epochs: 8198
> loading doc-idx mapping from ../my-gpt2_text_document_valid_indexmap_6440ns_1024sl_1234s_doc_idx.npy
> loading sample-idx mapping from ../my-gpt2_text_document_valid_indexmap_6440ns_1024sl_1234s_sample_idx.npy
> loading shuffle-idx mapping from ../my-gpt2_text_document_valid_indexmap_6440ns_1024sl_1234s_shuffle_idx.npy
loaded indexed file in 0.001 seconds
total number of samples: 6442
total number of epochs: 892
> loading doc-idx mapping from ../my-gpt2_text_document_test_indexmap_40ns_1024sl_1234s_doc_idx.npy
> loading sample-idx mapping from ../my-gpt2_text_document_test_indexmap_40ns_1024sl_1234s_sample_idx.npy
> loading shuffle-idx mapping from ../my-gpt2_text_document_test_indexmap_40ns_1024sl_1234s_shuffle_idx.npy
loaded indexed file in 0.001 seconds
total number of samples: 41
total number of epochs: 49
> finished creating GPT2 datasets ...
setting training data start iteration to 0
setting validation data start iteration to 0
done with setups ...
time (ms) | model and optimizer: 539.74 | train/valid/test data iterators: 285.37
training ...
In node1 (100.3.8.100), run ds_pretrain_gpt2-zero3.sh: hostfile:
100.3.8.100 slots=2
100.3.8.68 slots=2
root@15418839153d:/workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples# bash ds_pretrain_gpt2-zero3.sh
deepspeed --hostfile=myhostfile ../pretrain_gpt2.py --model-parallel-size 2 --num-layers 5 --hidden-size 1024 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --batch-size 4 --train-iters 320000 --lr-decay-iters 320000 --save ../gpt2_checkpoints/gpt2_ds_zero3 --data-path ../my-gpt2_text_document --vocab-file ../pretrain_models/gpt2/gpt2-vocab.json --merge-file ../pretrain_models/gpt2/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 1.5e-4 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --warmup 0.01 --checkpoint-activations --log-interval 1 --save-interval 10000 --eval-interval 2000 --eval-iters 10 --fp16 --scattered-embeddings --split-transformers --deepspeed --deepspeed_config /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_3_config.json --zero-stage 3 --zero-reduce-bucket-size 50000000 --zero-allgather-bucket-size 5000000000 --zero-contigious-gradients --zero-reduce-scatter --deepspeed-activation-checkpointing --checkpoint-num-layers 1 --partition-activations --checkpoint-in-cpu --synchronize-each-layer --contigious-checkpointing
[2023-03-03 02:15:23,176] [INFO] [runner.py:293:main] Using IP address of 100.3.8.100 for node 100.3.8.100
[2023-03-03 02:15:23,178] [INFO] [multinode_runner.py:51:get_cmd] Running on the following workers: 100.3.8.100,100.3.8.68
[2023-03-03 02:15:23,178] [INFO] [runner.py:360:main] cmd = pdsh -f 1024 -w 100.3.8.100,100.3.8.68 export NCCL_VERSION=2.7.8; export PYTHONPATH=/workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples; cd /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples; /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyIxMDAuMy44LjEwMCI6IFswLCAxXSwgIjEwMC4zLjguNjgiOiBbMCwgMV19 --node_rank=%n --master_addr=100.3.8.100 --master_port=29500 ../pretrain_gpt2.py --model-parallel-size '2' --num-layers '5' --hidden-size '1024' --num-attention-heads '16' --seq-length '1024' --max-position-embeddings '1024' --batch-size '4' --train-iters '320000' --lr-decay-iters '320000' --save '../gpt2_checkpoints/gpt2_ds_zero3' --data-path '../my-gpt2_text_document' --vocab-file '../pretrain_models/gpt2/gpt2-vocab.json' --merge-file '../pretrain_models/gpt2/gpt2-merges.txt' --data-impl 'mmap' --split '949,50,1' --distributed-backend 'nccl' --lr '1.5e-4' --lr-decay-style 'cosine' --min-lr '1.0e-5' --weight-decay '1e-2' --clip-grad '1.0' --warmup '0.01' --checkpoint-activations --log-interval '1' --save-interval '10000' --eval-interval '2000' --eval-iters '10' --fp16 --scattered-embeddings --split-transformers --deepspeed --deepspeed_config '/workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples/ds_zero_stage_3_config.json' --zero-stage '3' --zero-reduce-bucket-size '50000000' --zero-allgather-bucket-size '5000000000' --zero-contigious-gradients --zero-reduce-scatter --deepspeed-activation-checkpointing --checkpoint-num-layers '1' --partition-activations --checkpoint-in-cpu --synchronize-each-layer --contigious-checkpointing
100.3.8.100: bash: line 0: cd: /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples: No such file or directory
100.3.8.100: bash: /opt/conda/bin/python: No such file or directory
pdsh@15418839153d: 100.3.8.100: ssh exited with exit code 127
100.3.8.68: /etc/profile.d/lang.sh: line 19: warning: setlocale: LC_CTYPE: cannot change locale (C.UTF-8)
100.3.8.68: bash: line 0: cd: /workspace/megatron/Megatron-LM-v1.1.5-ZeRO3/examples: No such file or directory
100.3.8.68: bash: /opt/conda/bin/python: No such file or directory
pdsh@15418839153d: 100.3.8.68: ssh exited with exit code 127
@tjruwase hi, do you have any suggestions?
Hi, @bing0037 I did something similar before (testing ZeRO on a multi-node docker env). Here are some thoughts or suggestions:
The server (outside the container, hereinafter referred to as: Host) and docker env (inside the container, hereinafter referred to as: Container) can be considered as two completely independent environments (or the operating system, you even regard the Container as a virtual machine ), which is very imprecise but better understood.
I guess the IP you mentioned (100.3.8.XXX) is the IP of Host. When you start the docker container, if you do not specify the network parameter (--net
), the default is to use the Bridge network. At this point, there will be two IPs, one is Host IP (100.3.8.XXX), and the other is Container IP (typically: 172.XXX.XXX.XXX). The two IPs point to the two independent environments mentioned above. So when you specify the Deepspeed hostfile, you should actually use the Container IP instead of the Host IP. When you use Container IP, Deepspeed will establish an ssh connection, read files and execute scripts on your Host, but there is no /workspace
and /opt/conda/bin/python
(These paths are for Container) , which is the error message. However, the current Container IP specified in the hostfile should still not work, continue to look down.
As you know, distributed training of Deepspeed needs to use ssh, which requires communication between nodes (in short, ping). At present, you start the container using the Bridge network, and the two containers should be unable to ping (both containers use 172.XXX.XXX.XXX). There are roughly two solutions:
A typical GPU cluster will use InfiniBand to accelerate communication between multiple nodes. I'm not sure if the cluster you're using has. If using InfiniBand, add --privilege
when starting the container. This will make InfiniBand available inside the container. This worked for me, of course this requires making sure InfiniBand is available on the Host.
FYI, this is how I use docker while testing.
Start the container
docker run --gpus all --name AAAAA --net=BBBBB -itd --privileged -v CCCCC DDDDD bash
AAAAA: name of docker container BBBBB: name of the created Overlay network CCCCC: path DDDDD: name of docker image
Enter the container
docker exec -it AAAAA bash
Exit the container(The container is still running after exiting, unless you execute docker container stop AAAAA
)
exit
The above thoughts are based on the information you provided and my experience, there may be inaccuracies or mistakes.😂
Additional materials:
Hi, @bing0037 I did something similar before (testing ZeRO on a multi-node docker env). Here are some thoughts or suggestions:
- The server (outside the container, hereinafter referred to as: Host) and docker env (inside the container, hereinafter referred to as: Container) can be considered as two completely independent environments (or the operating system, you even regard the Container as a virtual machine ), which is very imprecise but better understood.
- I guess the IP you mentioned (100.3.8.XXX) is the IP of Host. When you start the docker container, if you do not specify the network parameter (
--net
), the default is to use the Bridge network. At this point, there will be two IPs, one is Host IP (100.3.8.XXX), and the other is Container IP (typically: 172.XXX.XXX.XXX). The two IPs point to the two independent environments mentioned above. So when you specify the Deepspeed hostfile, you should actually use the Container IP instead of the Host IP. When you use Container IP, Deepspeed will establish an ssh connection, read files and execute scripts on your Host, but there is no/workspace
and/opt/conda/bin/python
(These paths are for Container) , which is the error message. However, the current Container IP specified in the hostfile should still not work, continue to look down.As you know, distributed training of Deepspeed needs to use ssh, which requires communication between nodes (in short, ping). At present, you start the container using the Bridge network, and the two containers should be unable to ping (both containers use 172.XXX.XXX.XXX). There are roughly two solutions:
- Continue to use Bridge network. Then you need to manually configure the routing table, you can try, but I don't recommend it.
- Use Overlay network. The official summary of it: Overlay networks are best when you need containers running on different Docker hosts to communicate, or when multiple applications work together using swarm services. Obviously, the Overlay network is more in line with this usage scenario. For detailed configuration instructions, please refer to the official documentation. After using it, the typical Container IP is: 10.0.XXX.XXX. Using this Container IP, containers on different nodes will be able to communicate, but using ssh also needs to ensure that the ssh service is installed and enabled.
- A typical GPU cluster will use InfiniBand to accelerate communication between multiple nodes. I'm not sure if the cluster you're using has. If using InfiniBand, add
--privilege
when starting the container. This will make InfiniBand available inside the container. This worked for me, of course this requires making sure InfiniBand is available on the Host.FYI, this is how I use docker while testing. Start the container
docker run --gpus all --name AAAAA --net=BBBBB -itd --privileged -v CCCCC DDDDD bash
AAAAA: name of docker container BBBBB: name of the created Overlay network CCCCC: path DDDDD: name of docker image
Enter the container
docker exec -it AAAAA bash
Exit the container(The container is still running after exiting, unless you execute
docker container stop AAAAA
)exit
The above thoughts are based on the information you provided and my experience, there may be inaccuracies or mistakes.😂
Additional materials:
Thanks for your answer, it seems a reasonable way to solve the problem, but I wonder is there any way to run cmds (different in rank/role/port) in all docker env and then they start communication so that ssh is not required, like tensorflow's distributed training. By this I can run the distributed training in kubernetes.
This seems like the best answer so far, and the issue is fairly stale. Closing for now, if folks have other suggestions, please post here, if you have other questions, please open an issue so we can see it and reply. Thanks!
Official document creates overlay network: use-an-overlay-network-for-standalone-containers
Then the two containers can realize cross-machine communication through this network, and the rest is password-free login, environment configuration, etc.
Problem: How to run ds_pretrain_gpt2-zero3.sh in docker env using 2 nodes?
Current platform: 1) I have 2 nodes (100.3.8.100 & 100.3.8.68). For each node (with 8 V100 GPUs), the docker environments are all set and I can run ds_pretrain_gpt2-zero3.sh in local sucessfully using the following command:
2) I installed pdsh in each node: For 100.3.8.100, I install pdsh in docker environments (deepspeed:latest): For 100.3.8.68, I install pdsh in docker env (deepspeed:latest).
My modification: I want to use 2 nodes and I modify the ds_pretrain_gpt2-zero3.sh: 1) change
run_cmd
to:run_cmd=deepspeed --hostfile=myhostfile pretrain_gpt2.py ${@:2} ${full_options}
2) add myhostfile:My test and error: Test: in 100.3.8.100 server and docker env (deepspeed:latest)
Error:
Any suggestions? Thanks!