when I use nvme to offload param and optimizer I meet a bug[BUG]

etoilestar commented 1 year ago

python: /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:228: int deepspeed_aio_handle_t::pread(const at::Tensor&, const char, bool, bool): Assertion `static_cast(buffer.nbytes()) == num_file_bytes' failed. /nvme/zero_stage_3/optimizer/rank6/139649992513552.tensor.swp: buffer nbytes != file bytes 4001366016 != 3426746368 python: /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:228: int deepspeed_aio_handle_t::pread(const at::Tensor&, const char, bool, bool): Assertion static_cast<long long int>(buffer.nbytes()) == num_file_bytes' failed. /nvme/zero_stage_3/optimizer/rank2/139929715382368.tensor.swp: buffer nbytes != file bytes 4001366016 != 3599761408 /nvme/zero_stage_3/optimizer/rank1/140296723433568.tensor.swp: buffer nbytes != file bytes python: /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/aio/py_lib/deepspeed_py_aio_handle.cpp:228: int deepspeed_aio_handle_t::pread(const at::Tensor&, const char*, bool, bool): Assertionstatic_cast(buffer.nbytes()) == num_file_bytes' failed. 4001366016 != 3539992576

tjruwase commented 1 year ago

@etoilestar, can you share more details to repro this issue? In the meantime, can you confirm that /nvme/zero_stage_3/ is empty before running.

etoilestar commented 1 year ago

Yes, I emptied this folder before I ran the code, what kind of information should I provide, can you give me a hint?

etoilestar commented 1 year ago

I use 8 3090 graphics cards, and the code I execute is deepspeed_megatron to train gpt3. When I increase the buffer_count, this error will disappear, but it will freeze during the preprocessing process.

ReyRen commented 1 year ago

Describe the bug

Hi @tjruwase, really thanks for join us. I have totally same problem with @etoilestar.Please let me give some more details. I noticed the "buffer nbytes != file xxx" error already patched in #2002, and the version of deepspeed I used is latest one. But this problem also occurred.

To Reproduce

git clone https://github.com/microsoft/Megatron-DeepSpeed/
cd Megatron-DeepSpeed /examples/run_deepspeed_exam
# I already attatched modifed run_deepspeed_example.sh
/bin/bash modifed run_deepspeed_example.sh

Then, "buffer nbytes != file xxx" occurred.

System info (please complete the following information):

python 3.8.10
system: ubuntu:18.04
images : nvcr.io/nvidia/pytorch:22.12-py3
deepspeed: pip install deepspeed (0.9.1)
NVME: already formatted with ext4(not empty) on host and mounted into container

Thanks! script.zip

tjruwase commented 1 year ago

@ReyRen, could you please share your log as well?

tjruwase commented 1 year ago

@etoilestar and @ReyRen, I am trying to repro this issue. I am using a 4xV100-16GB which is probably different from your setups. Can you please share you stack trace as well? Thanks!

etoilestar commented 1 year ago

hello，thanks for your reply, he get the same log as me, here is my log:

**[2023-05-04 01:51:44,732] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-05-04 01:51:45,824] [INFO] [runner.py:540:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None pretrain_gpt.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 96 --hidden-size 3072 --num-attention-heads 96 --seq-length 2048 --loss-scale 12 --max-position-embeddings 2048 --micro-batch-size 1 --global-batch-size 8 --train-iters 1000 --lr 6.0e-5 --min-lr 6.0e-6 --lr-decay-style cosine --log-interval 1 --eval-iters 40 --eval-interval 1000 --data-path ../dataset/my-gpt2_text_document --vocab-file gpt2-vocab.json --merge-file gpt2-merges.txt --save-interval 1000 --split 98,2,0 --clip-grad 1.0 --weight-decay 0.1 --adam-beta1 0.9 --adam-beta2 0.95 --init-method-std 0.006 --fp16 --checkpoint-activations --tensorboard-dir ds_z3_nl96_hs3072_gb8_mb1 --cpu-optimizer --deepspeed-activation-checkpointing --zero-stage=3 --deepspeed_config=ds_config.json --no-pipeline-parallel --deepspeed --exit-interval 5000 [2023-05-04 01:51:48,924] [INFO] [launch.py:222:main] 0 NCCL_VERSION=2.15.5 [2023-05-04 01:51:48,924] [INFO] [launch.py:222:main] 0 NCCL_DEBUG=warn [2023-05-04 01:51:48,924] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2023-05-04 01:51:48,924] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=8, node_rank=0 [2023-05-04 01:51:48,924] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2023-05-04 01:51:48,924] [INFO] [launch.py:247:main] dist_world_size=8 [2023-05-04 01:51:48,924] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [92m[OKAY][0m

op name ................ installed .. compatible

async_io ............... [93m[NO][0m ....... [92m[OKAY][0m cpu_adagrad ............ [93m[NO][0m ....... [92m[OKAY][0m cpu_adam ............... [93m[NO][0m ....... [92m[OKAY][0m fused_adam ............. [93m[NO][0m ....... [92m[OKAY][0m fused_lamb ............. [93m[NO][0m ....... [92m[OKAY][0m quantizer .............. [93m[NO][0m ....... [92m[OKAY][0m random_ltd ............. [93m[NO][0m ....... [92m[OKAY][0m [93m [WARNING] [0m please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [93m[NO][0m ....... [93m[NO][0m

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [92m[OKAY][0m

op name ................ installed .. compatible

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [92m[OKAY][0m

op name ................ installed .. compatible

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [92m[OKAY][0m

op name ................ installed .. compatible

async_io ............... [93m[NO][0m ....... [92m[OKAY][0m cpu_adagrad ............ [93m[NO][0m ....... [92m[OKAY][0m cpu_adam ............... [93m[NO][0m ....... [92m[OKAY][0m fused_adam ............. [93m[NO][0m ....... [92m[OKAY][0m fused_lamb ............. [93m[NO][0m ....... [92m[OKAY][0m quantizer .............. [93m[NO][0m ....... [92m[OKAY][0m random_ltd ............. [93m[NO][0m ....... [92m[OKAY][0m [93m [WARNING] [0m please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [93m[NO][0m ....... [93m[NO][0m async_io ............... [93m[NO][0m ....... [92m[OKAY][0m cpu_adagrad ............ [93m[NO][0m ....... [92m[OKAY][0m cpu_adam ............... [93m[NO][0m ....... [92m[OKAY][0m fused_adam ............. [93m[NO][0m ....... [92m[OKAY][0m fused_lamb ............. [93m[NO][0m ....... [92m[OKAY][0m quantizer .............. [93m[NO][0m ....... [92m[OKAY][0m random_ltd ............. [93m[NO][0m ....... [92m[OKAY][0m [93m [WARNING] [0m please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [93m[NO][0m ....... [93m[NO][0m

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja async_io ............... [93m[NO][0m ....... [92m[OKAY][0m cpu_adagrad ............ [93m[NO][0m ....... ninja[92m[OKAY][0m .................. [92m[OKAY][0m

op name ................ installed .. compatiblecpu_adam

............... [93m[NO][0m ....... [92m[OKAY][0m fused_adam ............. [93m[NO][0m ....... [92m[OKAY][0m fused_lamb ............. [93m[NO][0m ....... [92m[OKAY][0m quantizer .............. [93m[NO][0m ....... [92m[OKAY][0m random_ltd ............. [93m[NO][0m ....... [92m[OKAY][0m [93m [WARNING] [0m please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [93m[NO][0m ....... [93m[NO][0m

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [92m[OKAY][0m

op name ................ installed .. compatible

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [92m[OKAY][0m

op name ................ installed .. compatible

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [92m[OKAY][0m

op name ................ installed .. compatible

spatial_inference ...... [93m[NO][0m ....... [92m[OKAY][0m transformer ............ [93m[NO][0m ....... [92m[OKAY][0m stochastic_transformer . [93m[NO][0m ....... [92m[OKAY][0m transformer_inference .. [93m[NO][0m ....... [92m[OKAY][0m utils .................. [93m[NO][0m ....... [92m[OKAY][0m

spatial_inference ...... [93m[NO][0m ....... [92m[OKAY][0m transformer ............ [93m[NO][0m ....... [92m[OKAY][0m stochastic_transformer . [93m[NO][0m ....... [92m[OKAY][0m DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 1.14.0a0+410ce96 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.9.1, unknown, unknown torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 1.14, cuda 11.8 spatial_inference ...... [93m[NO][0m ....... [92m[OKAY][0m transformer ............ [93m[NO][0m ....... [92m[OKAY][0m stochastic_transformer . [93m[NO][0m ....... [92m[OKAY][0m transformer_inference .. [93m[NO][0m ....... [92m[OKAY][0m utils .................. [93m[NO][0m ....... [92m[OKAY][0m

spatial_inference ...... [93m[NO][0m ....... [92m[OKAY][0m transformer ............ [93m[NO][0m ....... [92m[OKAY][0m stochastic_transformer . [93m[NO][0m ....... [92m[OKAY][0m DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 1.14.0a0+410ce96 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.9.1, unknown, unknown torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 1.14, cuda 11.8 Git info for Megatron: git_hash=unknown git_branch=unknown async_io ............... [93m[NO][0m ....... [92m[OKAY][0m cpu_adagrad ............ [93m[NO][0m ....... [92m[OKAY][0m cpu_adam ............... [93m[NO][0m ....... [92m[OKAY][0m fused_adam ............. [93m[NO][0m ....... [92m[OKAY][0m fused_lamb ............. [93m[NO][0m ....... [92m[OKAY][0m quantizer .............. [93m[NO][0m ....... [92m[OKAY][0m random_ltd ............. [93m[NO][0m ....... [92m[OKAY][0m [93m [WARNING] [0m please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [93m[NO][0m ....... [93m[NO][0m transformer_inference .. [93m[NO][0m ....... [92m[OKAY][0m utils .................. [93m[NO][0m ....... [92m[OKAY][0m

transformer_inference .. [93m[NO][0m ....... [92m[OKAY][0m utils .................. [93m[NO][0m ....... [92m[OKAY][0m

async_io ............... [93m[NO][0m ....... [92m[OKAY][0m cpu_adagrad ............ [93m[NO][0m ....... [92m[OKAY][0m cpu_adam ............... [93m[NO][0m ....... [92m[OKAY][0m fused_adam ............. [93m[NO][0m ....... [92m[OKAY][0m fused_lamb ............. [93m[NO][0m ....... [92m[OKAY][0m quantizer .............. [93m[NO][0m ....... [92m[OKAY][0m random_ltd ............. [93m[NO][0m ....... [92m[OKAY][0m [93m [WARNING] [0m please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [93m[NO][0m ....... [93m[NO][0m async_io ............... [93m[NO][0m ....... [92m[OKAY][0m cpu_adagrad ............ [93m[NO][0m ....... [92m[OKAY][0m cpu_adam ............... [93m[NO][0m ....... [92m[OKAY][0m fused_adam ............. [93m[NO][0m ....... [92m[OKAY][0m fused_lamb ............. [93m[NO][0m ....... [92m[OKAY][0m quantizer .............. [93m[NO][0m ....... [92m[OKAY][0m random_ltd ............. [93m[NO][0m ....... [92m[OKAY][0m [93m [WARNING] [0m please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [93m[NO][0m ....... [93m[NO][0m async_io ............... [93m[NO][0m ....... [92m[OKAY][0m cpu_adagrad ............ [93m[NO][0m ....... [92m[OKAY][0m cpu_adam ............... [93m[NO][0m ....... [92m[OKAY][0m fused_adam ............. [93m[NO][0m ....... [92m[OKAY][0m fused_lamb ............. [93m[NO][0m ....... [92m[OKAY][0m quantizer .............. [93m[NO][0m ....... [92m[OKAY][0m random_ltd ............. [93m[NO][0m ....... [92m[OKAY][0m [93m [WARNING] [0m please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [93m[NO][0m ....... [93m[NO][0m DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 1.14.0a0+410ce96 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.9.1, unknown, unknown torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 1.14, cuda 11.8 Git info for Megatron: git_hash=unknown git_branch=unknown DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 1.14.0a0+410ce96 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.9.1, unknown, unknown torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 1.14, cuda 11.8 Git info for Megatron: git_hash=unknown git_branch=unknown Git info for Megatron: git_hash=unknown git_branch=unknown using world size: 8, data-parallel-size: 8, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 using torch.float16 for parameters ... ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 .............. False adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.95 adam_eps ........................................ 1e-08 adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 aml_data_download_path .......................... None apply_query_key_layer_scaling ................... True apply_residual_connection_post_layernorm ........ False attention_dropout ............................... 0.1 attention_softmax_in_fp32 ....................... False bert_binary_head ................................ True bert_load ....................................... None bf16 ............................................ False bias_dropout_fusion ............................. True bias_gelu_fusion ................................ True biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None checkpoint_activations .......................... True checkpoint_in_cpu ............................... False checkpoint_num_layers ........................... 1 clip_grad ....................................... 1.0 compression_training ............................ False consumed_train_samples .......................... 0 consumed_train_tokens ........................... 0 consumed_valid_samples .......................... 0 contigious_checkpointing ........................ False cpu_optimizer ................................... True cpu_torch_adam .................................. False create_moe_param_group .......................... False curriculum_learning_legacy ...................... False custom_token_counting ........................... False data_efficiency_curriculum_learning ............. False data_impl ....................................... infer data_parallel_size .............................. 8 data_path ....................................... ['../dataset/my-gpt2_text_document'] dataloader_type ................................. single DDP_impl ........................................ local decoder_seq_length .............................. None deepscale ....................................... False deepscale_config ................................ None deepspeed ....................................... True deepspeed_activation_checkpointing .............. True deepspeed_config ................................ ds_config.json deepspeed_mpi ................................... False distribute_checkpointed_activations ............. False distributed_backend ............................. nccl ds_inference .................................... False ds_pipeline_enabled ............................. False embedding_path .................................. None enable_expert_tensor_parallelism ................ False encoder_seq_length .............................. 2048 eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 40 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... 5000 expert_interval ................................. 2 ffn_hidden_size ................................. 12288 finetune ........................................ False fp16 ............................................ True fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False global_batch_size ............................... 8 hidden_dropout .................................. 0.1 hidden_size ..................................... 3072 hidden_size_teacher ............................. None hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_dim ......................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 inference ....................................... False init_method_std ................................. 0.006 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 kd .............................................. False kd_alpha_ce ..................................... 1 kd_beta_ce ...................................... 1 kd_temp ......................................... 1.0 kv_channels ..................................... 32 layernorm_epsilon ............................... 1e-05 lazy_mpu_init ................................... None load ............................................ None load_teacher .................................... None local_rank ...................................... 0 log_batch_size_to_tensorboard ................... False log_interval .................................... 1 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_num_zeros_in_grad ........................... False log_optimizer_states_to_tensorboard ............. False log_params_norm ................................. False log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False loss_scale ...................................... 12.0 loss_scale_window ............................... 1000 lr .............................................. 6e-05 lr_decay_iters .................................. None lr_decay_samples ................................ None lr_decay_style .................................. cosine lr_decay_tokens ................................. None lr_warmup_fraction .............................. None lr_warmup_iters ................................. 0 lr_warmup_samples ............................... 0 lr_warmup_tokens ................................ None make_vocab_size_divisible_by .................... 128 mask_prob ....................................... 0.15 masked_softmax_fusion ........................... True max_position_embeddings ......................... 2048 memory_centric_tiled_linear ..................... False merge_file ...................................... gpt2-merges.txt micro_batch_size ................................ 1 min_loss_scale .................................. 1.0 min_lr .......................................... 6e-06 mlp_type ........................................ standard mmap_warmup ..................................... False moe_eval_capacity_factor ........................ 1.0 moe_expert_parallel_size ........................ 1 moe_loss_coeff .................................. 0.1 moe_min_capacity ................................ 4 moe_token_dropping .............................. True moe_train_capacity_factor ....................... 1.0 mos ............................................. False no_load_lr_state ................................ False no_load_optim ................................... None no_load_rng ..................................... None no_pipeline_parallel ............................ True no_save_optim ................................... None no_save_rng ..................................... None num_attention_heads ............................. 96 num_attention_heads_teacher ..................... None num_channels .................................... 3 num_classes ..................................... 1000 num_experts ..................................... [1] num_experts_teacher ............................. [1] num_layers ...................................... 96 num_layers_per_virtual_pipeline_stage ........... None num_layers_teacher .............................. None num_workers ..................................... 2 onnx_safe ....................................... None openai_gelu ..................................... False optimizer ....................................... adam override_lr_scheduler ........................... False params_dtype .................................... torch.float16 partition_activations ........................... False patch_dim ....................................... 16 pipeline_model_parallel_size .................... 1 profile_backward ................................ False query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None random_ltd ...................................... False rank ............................................ 0 remote_device ................................... none reset_attention_mask ............................ False reset_iteration ................................. False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 return_data_index ............................... False sample_rate ..................................... 1.0 save ............................................ None save_interval ................................... 1000 scatter_gather_tensors_in_pipeline .............. True scattered_embeddings ............................ False seed ............................................ 1234 seq_length ...................................... 2048 sgd_momentum .................................... 0.9 short_seq_prob .................................. 0.1 split ........................................... 98,2,0 split_transformers .............................. False synchronize_each_layer .......................... False tensor_model_parallel_size ...................... 1 tensorboard_dir ................................. ds_z3_nl96_hs3072_gb8_mb1 tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 tile_factor ..................................... 1 titles_data_path ................................ None tokenizer_type .................................. GPT2BPETokenizer topk ............................................ 1 train_data_exact_num_epochs ..................... None train_doc_idx_path .............................. None train_idx_path .................................. None train_iters ..................................... 1000 train_sample_idx_path ........................... None train_samples ................................... None train_shuffle_idx_path .......................... None train_tokens .................................... None use_checkpoint_lr_scheduler ..................... False use_contiguous_buffers_in_ddp ................... False use_cpu_initialization .......................... None use_one_sent_docs ............................... False use_pin_memory .................................. False use_tutel ....................................... False virtual_pipeline_model_parallel_size ............ None vocab_extra_ids ................................. 0 vocab_file ...................................... gpt2-vocab.json weight_decay .................................... 0.1 world_size ...................................... 8 zero_allgather_bucket_size ...................... 0.0 zero_contigious_gradients ....................... False zero_reduce_bucket_size ......................... 0.0 zero_reduce_scatter ............................. False zero_stage ...................................... 3 -------------------- end of arguments --------------------- setting number of micro-batches to constant 1

building GPT2BPETokenizer tokenizer ... spatial_inference ...... [93m[NO][0m ....... [92m[OKAY][0m transformer ............ [93m[NO][0m ....... [92m[OKAY][0m stochastic_transformer . [93m[NO][0m ....... [92m[OKAY][0m padded vocab (size: 50257) with 47 dummy tokens (new size: 50304) initializing torch distributed ... [2023-05-04 01:51:54,265] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl spatial_inference ...... [93m[NO][0m ....... [92m[OKAY][0m transformer ............ [93m[NO][0m ....... [92m[OKAY][0m stochastic_transformer . [93m[NO][0m ....... [92m[OKAY][0m spatial_inference ...... [93m[NO][0m ....... [92m[OKAY][0m transformer ............ [93m[NO][0m ....... [92m[OKAY][0m stochastic_transformer . [93m[NO][0m ....... [92m[OKAY][0m spatial_inference ...... [93m[NO][0m ....... [92m[OKAY][0m transformer ............ [93m[NO][0m ....... [92m[OKAY][0m stochastic_transformer . [93m[NO][0m ....... [92m[OKAY][0m transformer_inference .. [93m[NO][0m ....... [92m[OKAY][0m utils .................. [93m[NO][0m ....... [92m[OKAY][0m

transformer_inference .. [93m[NO][0m ....... [92m[OKAY][0m utils .................. [93m[NO][0m ....... [92m[OKAY][0m

transformer_inference .. [93m[NO][0m ....... [92m[OKAY][0m utils .................. [93m[NO][0m ....... [92m[OKAY][0m

transformer_inference .. [93m[NO][0m ....... [92m[OKAY][0m utils .................. [93m[NO][0m ....... [92m[OKAY][0m

DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 1.14.0a0+410ce96 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.9.1, unknown, unknown torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 1.14, cuda 11.8 DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 1.14.0a0+410ce96 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.9.1, unknown, unknown torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 1.14, cuda 11.8 DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 1.14.0a0+410ce96 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.9.1, unknown, unknown torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 1.14, cuda 11.8 DeepSpeed general environment info: torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch'] torch version .................... 1.14.0a0+410ce96 deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed'] deepspeed info ................... 0.9.1, unknown, unknown torch cuda version ............... 11.8 torch hip version ................ None nvcc version ..................... 11.8 deepspeed wheel compiled w. ...... torch 1.14, cuda 11.8 Git info for Megatron: git_hash=unknown git_branch=unknown Git info for Megatron: git_hash=unknown git_branch=unknown Git info for Megatron: git_hash=unknown git_branch=unknown Git info for Megatron: git_hash=unknown git_branch=unknown setting tensorboard ... initializing tensor model parallel with size 1 initializing pipeline model parallel with size 1 setting random seeds to 1234 ... [2023-05-04 01:51:55,565] [INFO] [checkpointing.py:226:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 compiling dataset index builder ... make: Entering directory '/workspace/megatron/megatron/data' make: Nothing to be done for 'default'. make: Leaving directory '/workspace/megatron/megatron/data'

done with dataset index builder. Compilation time: 0.110 seconds compiling and loading fused kernels ... ninja: no work to do. ninja: no work to do. ninja: no work to do. NCCL version 2.15.5+cuda11.8 done with compiling and loading fused kernels. Compilation time: 6.573 seconds time to initialize megatron (seconds): 66.251 [after megatron is initialized] datetime: 2023-05-04 01:52:02 building GPT model ... [2023-05-04 01:52:02,435] [INFO] [utils.py:785:see_memory_usage] Before Building Model [2023-05-04 01:52:02,436] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.0 GB Max_CA 0 GB [2023-05-04 01:52:02,437] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 23.49 GB, percent = 2.3% [2023-05-04 01:52:06,274] [INFO] [utils.py:30:print_object] AsyncPartitionedParameterSwapper: [2023-05-04 01:52:06,274] [INFO] [utils.py:34:print_object] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-05-04 01:52:06,274] [INFO] [utils.py:34:print_object] aio_handle ................... <class 'async_io.aio_handle'> [2023-05-04 01:52:06,274] [INFO] [utils.py:34:print_object] aligned_bytes ................ 1024 [2023-05-04 01:52:06,274] [INFO] [utils.py:34:print_object] aligned_elements_per_buffer .. 100000256 [2023-05-04 01:52:06,274] [INFO] [utils.py:34:print_object] available_buffer_ids ......... [0, 1, 2, 3, 4] [2023-05-04 01:52:06,274] [INFO] [utils.py:34:print_object] available_numel .............. 0 [2023-05-04 01:52:06,274] [INFO] [utils.py:34:print_object] available_params ............. set() [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] dtype ........................ torch.float16 [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] elements_per_buffer .......... 100,000,000 [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] id_to_path ................... {} [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] inflight_numel ............... 0 [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] inflight_params .............. [] [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] inflight_swap_in_buffers ..... [] [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] invalid_buffer ............... 1.0 [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] min_aio_bytes ................ 1048576 [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] numel_alignment .............. 512 [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] param_buffer_count ........... 5 [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] param_id_to_buffer_id ........ {} [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] param_id_to_numel ............ {} [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] param_id_to_swap_buffer ...... {} [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] partitioned_swap_buffer ...... None [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] partitioned_swap_pool ........ None [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] pending_reads ................ 0 [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] pending_writes ............... 0 [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] reserved_buffer_ids .......... [] [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] swap_config .................. device='nvme' nvme_path=PosixPath('/nvme') buffer_count=5 buffer_size=100,000,000 max_in_cpu=1,000,000,000 pin_memory=True [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] swap_element_size ............ 2 [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] swap_folder .................. /nvme/zero_stage_3/float16params/rank0 [2023-05-04 01:52:06,275] [INFO] [utils.py:34:print_object] swap_out_params .............. [] [2023-05-04 01:52:23,724] [INFO] [partition_parameters.py:454:exit] finished initializing model with 11.04B parameters [2023-05-04 01:52:23,877] [INFO] [utils.py:785:see_memory_usage] After Building Model [2023-05-04 01:52:23,878] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.29 GB CA 0.29 GB Max_CA 0 GB [2023-05-04 01:52:23,878] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 32.4 GB, percent = 3.2% number of parameters on (tensor, pipeline) model parallel rank (0, 0): 11036301312 ninja: no work to do. Time to load cpu_adam op: 2.895385980606079 seconds Time to load cpu_adam op: 2.773625612258911 seconds Time to load cpu_adam op: 2.7883458137512207 seconds ninja: no work to do. Time to load cpu_adam op: 3.0697195529937744 seconds Time to load cpu_adam op: 3.0779221057891846 seconds Time to load cpu_adam op: 3.117755889892578 seconds Time to load cpu_adam op: 3.1241681575775146 seconds Time to load cpu_adam op: 3.1538240909576416 seconds Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000060, betas=(0.900000, 0.999000), weight_decay=0.100000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000060, betas=(0.900000, 0.999000), weight_decay=0.100000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000060, betas=(0.900000, 0.999000), weight_decay=0.100000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000060, betas=(0.900000, 0.999000), weight_decay=0.100000, adam_w=1 learning rate decay style: cosine DeepSpeed is enabled. [2023-05-04 01:52:30,247] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.1, git-hash=unknown, git-branch=unknown [2023-05-04 01:52:30,292] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2023-05-04 01:52:30,295] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer [2023-05-04 01:52:30,295] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000060, betas=(0.900000, 0.999000), weight_decay=0.100000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000060, betas=(0.900000, 0.999000), weight_decay=0.100000, adam_w=1 [2023-05-04 01:52:30,454] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam [2023-05-04 01:52:30,454] [INFO] [utils.py:51:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2023-05-04 01:52:30,454] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000060, betas=(0.900000, 0.999000), weight_decay=0.100000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000060, betas=(0.900000, 0.999000), weight_decay=0.100000, adam_w=1 [2023-05-04 01:52:30,568] [INFO] [utils.py:785:see_memory_usage] Stage 3 initialize beginning [2023-05-04 01:52:30,569] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB [2023-05-04 01:52:30,569] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 47.41 GB, percent = 4.7% [2023-05-04 01:52:30,573] [INFO] [stage3.py:113:init] Reduce bucket size 90000000 [2023-05-04 01:52:30,573] [INFO] [stage3.py:114:init] Prefetch bucket size 50000000 ninja: no work to do. Time to load utils op: 0.2917904853820801 seconds Time to load utils op: 0.1040353775024414 seconds [2023-05-04 01:52:30,774] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [begin] [2023-05-04 01:52:30,775] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB [2023-05-04 01:52:30,775] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 47.41 GB, percent = 4.7% Parameter Offload: Total persistent parameters: 3840000 in 770 params [2023-05-04 01:52:30,912] [INFO] [utils.py:785:see_memory_usage] DeepSpeedZeRoOffload initialize [end] [2023-05-04 01:52:30,913] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB [2023-05-04 01:52:30,913] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 47.42 GB, percent = 4.7% ninja: no work to do. Time to load utils op: 0.2980797290802002 seconds Time to load utils op: 0.6072814464569092 seconds Time to load utils op: 0.3056302070617676 seconds [2023-05-04 01:52:31,009] [INFO] [utils.py:785:see_memory_usage] Before creating fp16 partitions [2023-05-04 01:52:31,010] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB [2023-05-04 01:52:31,010] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 47.41 GB, percent = 4.7% Time to load utils op: 0.6070020198822021 seconds Time to load utils op: 0.4060196876525879 seconds Time to load utils op: 0.40641093254089355 seconds [2023-05-04 01:52:40,180] [INFO] [utils.py:785:see_memory_usage] After creating fp16 partitions: 15 [2023-05-04 01:52:40,181] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB [2023-05-04 01:52:40,181] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 86.18 GB, percent = 8.6% [2023-05-04 01:52:40,181] [INFO] [stage3.py:467:_configure_tensor_swapping] Tensor Swapping: Adding optimizer tensors [2023-05-04 01:52:42,207] [INFO] [utils.py:30:print_object] SwapBufferManager: [2023-05-04 01:52:42,208] [INFO] [utils.py:34:print_object] count ........................ 4 [2023-05-04 01:52:42,208] [INFO] [utils.py:34:print_object] dtype ........................ torch.float32 [2023-05-04 01:52:42,208] [INFO] [utils.py:34:print_object] free_buffer_index ............ [0, 1, 2, 3] [2023-05-04 01:52:42,208] [INFO] [utils.py:34:print_object] gigabytes .................... 1.546875 [2023-05-04 01:52:42,208] [INFO] [utils.py:34:print_object] num_elems .................... 103809024 [2023-05-04 01:52:42,208] [INFO] [utils.py:34:print_object] used_buffer_index ............ {} Time to load async_io op: 2.539290189743042 seconds Time to load async_io op: 2.6841211318969727 seconds Time to load async_io op: 2.691549777984619 seconds Time to load async_io op: 2.706519365310669 seconds Time to load async_io op: 2.7506959438323975 seconds Time to load async_io op: 2.815930128097534 seconds Time to load async_io op: 2.7707629203796387 seconds Time to load async_io op: 2.797130584716797 seconds [2023-05-04 01:52:45,262] [INFO] [utils.py:30:print_object] PartitionedOptimizerSwapper: [2023-05-04 01:52:45,262] [INFO] [utils.py:34:print_object] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-05-04 01:52:45,263] [INFO] [utils.py:34:print_object] aligned_bytes ................ 1024 [2023-05-04 01:52:45,263] [INFO] [utils.py:34:print_object] dtype ........................ torch.float32 [2023-05-04 01:52:45,263] [INFO] [utils.py:34:print_object] largest_numel ................ 103809024 [2023-05-04 01:52:45,263] [INFO] [utils.py:34:print_object] min_aio_bytes ................ 1048576 [2023-05-04 01:52:45,263] [INFO] [utils.py:34:print_object] numel_alignment .............. 256 [2023-05-04 01:52:45,263] [INFO] [utils.py:34:print_object] swap_config .................. device='nvme' nvme_path=PosixPath('/nvme') buffer_count=4 pin_memory=True pipeline=False pipeline_read=False pipeline_write=False fast_init=False [2023-05-04 01:52:45,263] [INFO] [utils.py:34:print_object] swap_element_size ............ 4 [2023-05-04 01:52:45,263] [INFO] [utils.py:34:print_object] swap_folder .................. /nvme/zero_stage_3/optimizer/rank0 [2023-05-04 01:52:45,528] [INFO] [utils.py:785:see_memory_usage] Before creating fp32 partitions [2023-05-04 01:52:45,529] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB [2023-05-04 01:52:45,529] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 97.61 GB, percent = 9.7% [2023-05-04 01:53:02,744] [INFO] [utils.py:785:see_memory_usage] After creating fp32 partitions [2023-05-04 01:53:02,745] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB [2023-05-04 01:53:02,745] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 96.41 GB, percent = 9.6% [2023-05-04 01:53:02,933] [INFO] [utils.py:785:see_memory_usage] Before initializing optimizer states [2023-05-04 01:53:02,934] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB [2023-05-04 01:53:02,934] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 99.09 GB, percent = 9.8% [2023-05-04 01:54:23,873] [INFO] [utils.py:785:see_memory_usage] After initializing optimizer states [2023-05-04 01:54:23,874] [INFO] [utils.py:786:see_memory_usage] MA 0.0 GB Max_MA 0.0 GB CA 0.29 GB Max_CA 0 GB [2023-05-04 01:54:23,874] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 99.36 GB, percent = 9.9% [2023-05-04 01:54:24,930] [INFO] [stage3.py:366:_setup_for_real_optimizer] optimizer state initialized Time to load utils op: 0.0007295608520507812 seconds Time to load utils op: 0.0007138252258300781 seconds Time to load utils op: 0.0008652210235595703 seconds Time to load utils op: 0.0006787776947021484 seconds Time to load utils op: 0.0007004737854003906 seconds Time to load utils op: 0.0007026195526123047 seconds Time to load utils op: 0.0013163089752197266 seconds [2023-05-04 01:54:39,709] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer [2023-05-04 01:54:39,710] [INFO] [utils.py:786:see_memory_usage] MA 0.17 GB Max_MA 0.74 GB CA 1.16 GB Max_CA 1 GB [2023-05-04 01:54:39,710] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory: used = 129.04 GB, percent = 12.8% [2023-05-04 01:54:39,711] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedCPUAdam [2023-05-04 01:54:39,711] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler [2023-05-04 01:54:39,711] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = <megatron.learning_rates.AnnealingLR object at 0x7f6b2a433e20> [2023-05-04 01:54:39,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[5.9999999999999995e-05, 5.9999999999999995e-05], mom=[(0.9, 0.999), (0.9, 0.999)] [2023-05-04 01:54:39,713] [INFO] [config.py:953:print] DeepSpeedEngine configuration: [2023-05-04 01:54:39,713] [INFO] [config.py:957:print] activation_checkpointing_config { "partition_activations": false, "contiguous_memory_optimization": false, "cpu_checkpointing": false, "number_checkpoints": null, "synchronize_checkpoint_boundary": false, "profile": false } [2023-05-04 01:54:39,713] [INFO] [config.py:957:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True} [2023-05-04 01:54:39,713] [INFO] [config.py:957:print] amp_enabled .................. False [2023-05-04 01:54:39,713] [INFO] [config.py:957:print] amp_params ................... False [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] autotuning_config ............ { "enabled": false, "start_step": null, "end_step": null, "metric_path": null, "arg_mappings": null, "metric": "throughput", "model_info": null, "results_dir": "autotuning_results", "exps_dir": "autotuning_exps", "overwrite": true, "fast": true, "start_profile_step": 3, "end_profile_step": 5, "tuner_type": "gridsearch", "tuner_early_stopping": 5, "tuner_num_trials": 50, "model_info_path": null, "mp_size": 1, "max_train_batch_size": null, "min_train_batch_size": 1, "max_train_micro_batch_size_per_gpu": 1.024000e+03, "min_train_micro_batch_size_per_gpu": 1, "num_tuning_micro_batch_sizes": 3 } [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] bfloat16_enabled ............. False [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] checkpoint_parallel_write_pipeline False [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] checkpoint_tag_validation_enabled True [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] checkpoint_tag_validation_fail False [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f6b1af9df70> [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] communication_data_type ...... None [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}} [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] curriculum_enabled_legacy .... False [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] curriculum_params_legacy ..... False [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}} [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] data_efficiency_enabled ...... False [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] dataloader_drop_last ......... False [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] disable_allgather ............ False [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] dump_state ................... False [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] dynamic_loss_scale_args ...... {'init_scale': 4096, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1} [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] eigenvalue_enabled ........... False [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] eigenvalue_gas_boundary_resolution 1 [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] eigenvalue_layer_name ........ bert.encoder.layer [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] eigenvalue_layer_num ......... 0 [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] eigenvalue_max_iter .......... 100 [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] eigenvalue_stability ......... 1e-06 [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] eigenvalue_tol ............... 0.01 [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] eigenvalue_verbose ........... False [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] elasticity_enabled ........... False [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] flops_profiler_config ........ { "enabled": false, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null } [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] fp16_auto_cast ............... False [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] fp16_enabled ................. True [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] fp16_master_weights_and_gradients False [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] global_rank .................. 0 [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] grad_accum_dtype ............. None [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] gradient_accumulation_steps .. 1 [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] gradient_clipping ............ 0.0 [2023-05-04 01:54:39,714] [INFO] [config.py:957:print] gradient_predivide_factor .... 1.0 [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8 [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] initial_dynamic_scale ........ 4096 [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] load_universal_checkpoint .... False [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] loss_scale ................... 0 [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] memory_breakdown ............. False [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] nebula_config ................ { "enabled": false, "persistent_storage_path": null, "persistent_time_interval": 100, "num_of_version_in_retention": 2, "enable_nebula_load": true, "load_path": null } [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] optimizer_legacy_fusion ...... False [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] optimizer_name ............... None [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] optimizer_params ............. None [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0} [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] pld_enabled .................. False [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] pld_params ................... False [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] prescale_gradients ........... False [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] scheduler_name ............... None [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] scheduler_params ............. None [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] sparse_attention ............. None [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] sparse_gradients_enabled ..... False [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] steps_per_print .............. 1 [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] train_batch_size ............. 8 [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] train_micro_batch_size_per_gpu 1 [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] use_node_local_storage ....... False [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] wall_clock_breakdown ......... False [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] world_size ................... 8 [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] zero_allow_untested_optimizer False [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=90000000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='nvme', nvme_path=PosixPath('/nvme'), buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=True) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='nvme', nvme_path=PosixPath('/nvme'), buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=100000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=sys.maxsize max_live_parameters=3000000000 max_reuse_distance=3000000000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False memory_efficient_linear=True [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] zero_enabled ................. True [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] zero_force_ds_cpu_optimizer .. True [2023-05-04 01:54:39,715] [INFO] [config.py:957:print] zero_optimization_stage ...... 3 [2023-05-04 01:54:39,716] [INFO] [config.py:943:print_user_config] json = { "train_batch_size": 8, "train_micro_batch_size_per_gpu": 1, "steps_per_print": 1, "zero_optimization": { "stage": 3, "stage3_max_live_parameters": 3.000000e+09, "stage3_max_reuse_distance": 3.000000e+09, "stage3_param_persistence_threshold": 1.000000e+05, "stage3_prefetch_bucket_size": 5.000000e+07, "contiguous_gradients": true, "overlap_comm": true, "reduce_bucket_size": 9.000000e+07, "sub_group_size": 1.000000e+08, "offload_param": { "device": "nvme", "nvme_path": "/nvme", "pin_memory": true }, "offload_optimizer": { "device": "nvme", "pipeline_read": false, "pipeline_write": false, "nvme_path": "/nvme", "pin_memory": true } }, "fp16": { "enabled": true, "initial_scale_power": 12 } } Time to load utils op: 0.00044536590576171875 seconds [after model, optimizer, and learning rate scheduler are built] datetime: 2023-05-04 01:54:39 building train, validation, and test datasets ... datasets target sizes (minimum size): train: 8000 validation: 640 test: 320 building train, validation, and test datasets for GPT ... building dataset index ... reading sizes... reading pointers... reading document index... creating numpy buffer of mmap... creating memory view of numpy buffer... finished creating indexed dataset in 0.000360 seconds number of documents: 250000 dataset split: train: document indices in [0, 245000) total of 245000 documents validation: document indices in [245000, 250000) total of 5000 documents test: document indices in [250000, 250000) total of 0 documents NCCL version 2.15.5+cuda11.8 NCCL version 2.15.5+cuda11.8 NCCL version 2.15.5+cuda11.8 NCCL version 2.15.5+cuda11.8 NCCL version 2.15.5+cuda11.8 NCCL version 2.15.5+cuda11.8 NCCL version 2.15.5+cuda11.8 loading doc-idx mapping from ../dataset/my-gpt2_text_document_train_indexmap_8000ns_2048sl_1234s_doc_idx.npy loading sample-idx mapping from ../dataset/my-gpt2_text_document_train_indexmap_8000ns_2048sl_1234s_sample_idx.npy loading shuffle-idx mapping from ../dataset/my-gpt2_text_document_train_indexmap_8000ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.002 seconds total number of samples: 70128 total number of epochs: 1 loading doc-idx mapping from ../dataset/my-gpt2_text_document_valid_indexmap_640ns_2048sl_1234s_doc_idx.npy loading sample-idx mapping from ../dataset/my-gpt2_text_document_valid_indexmap_640ns_2048sl_1234s_sample_idx.npy loading shuffle-idx mapping from ../dataset/my-gpt2_text_document_valid_indexmap_640ns_2048sl_1234s_shuffle_idx.npy loaded indexed file in 0.001 seconds total number of samples: 1443 total number of epochs: 1 finished creating GPT datasets ... [after dataloaders are built] datetime: 2023-05-04 01:54:40 time (ms) | model-and-optimizer-setup: 157334.23 | train/valid/test-data-iterators-setup: 682.76 done with setup ... training ... [before the start of training step] datetime: 2023-05-04 01:54:40 [2023-05-04 01:54:40,562] [INFO] [checkpointing.py:529:forward] Activation Checkpointing Information [2023-05-04 01:54:40,562] [INFO] [checkpointing.py:530:forward] ----Partition Activations False, CPU CHECKPOINTING False [2023-05-04 01:54:40,562] [INFO] [checkpointing.py:531:forward] ----contiguous Memory Checkpointing False with 96 total layers [2023-05-04 01:54:40,562] [INFO] [checkpointing.py:533:forward] ----Synchronization False [2023-05-04 01:54:40,562] [INFO] [checkpointing.py:534:forward] ----Profiling time in checkpointing False /nvme/zero_stage_3/optimizer/rank4/140370293727776_gradient_94420992_1179648.tensor.swp: buffer nbytes != file bytes 4718592 != 0 /nvme/zero_stage_3/optimizer/rank6/139716763042496_gradient_94420992_1179648.tensor.swp: buffer nbytes != file bytes 4718592 != 0 /nvme/zero_stage_3/optimizer/rank5/140154453562048_gradient_94420992_1179648.tensor.swp: buffer nbytes != file bytes 4718592 != 0 /nvme/zero_stage_3/optimizer/rank3/139926213575360_gradient_94420992_1179648.tensor.swp: buffer nbytes != file bytes 4718592 != 0 /nvme/zero_stage_3/optimizer/rank0/140093515555200_gradient_94420992_1179648.tensor.swp: buffer nbytes != file bytes 4718592 != 0 /nvme/zero_stage_3/optimizer/rank1/139730926629568_gradient_94420992_1179648.tensor.swp: buffer nbytes != file bytes 4718592 != 0 /nvme/zero_stage_3/optimizer/rank7/139738534742752_gradient_94420992_1179648.tensor.swp: buffer nbytes != file bytes 4718592 != 0 /nvme/zero_stage_3/optimizer/rank2/140168675173056_gradient_94420992_1179648.tensor.swp: buffer nbytes != file bytes 4718592 != 0 **

etoilestar commented 1 year ago

Also，sometimes, the process freezes when running the same script.

tjruwase commented 1 year ago

@etoilestar, thanks for sharing your log. Can you please do the following:

Share the stack trace of the failure if possible.
Share the size of /nvme/zero_stage_3/optimizer/rank4/140370293727776_gradient_94420992_1179648.tensor.swp.
Try a smaller model by reducing the number of layers from 96 to 8.

etoilestar commented 1 year ago

hello, could you tell me how to get the stack trace? the size of this file is 0, and I just want to use disk to train a larger model. thanks

tjruwase commented 1 year ago

The stack trace should be printed alongside error message and shows the code path leading to the failure.

A file size of 0 means the previous file write (creation) failed. Can you try running with a smaller model as suggested?

etoilestar commented 1 year ago

yes, when I reduce the number of layers to 8, the program runs normally.

tjruwase commented 1 year ago

In that case, I am curious whether failure is filesystem problem, such as running out of disk space. How large is the offload folder?

etoilestar commented 1 year ago

it is around 10T, I guess this bug is caused by the nvme is not as fast as expected.

tjruwase commented 1 year ago

Ideally, nvme speed should affect throughput but not cause failures.

If you would like to continue this investigation can you please do the following?

You can use the following to measure the nvme performance: https://github.com/microsoft/DeepSpeed/issues/998#issuecomment-836944772
Can you use binary search to increase the number of layers from 8 until you hit the failure? Then we can debug from there.

etoilestar commented 1 year ago

Okay, I will try again later.

etoilestar commented 1 year ago

there is another situation，when I increase buffer_count from 4 to 96, the size of .swp file is not zero， yet the the process freezes.

etoilestar commented 1 year ago

Maybe you can take it into consideration.

etoilestar commented 1 year ago

hello，it seems that you did not finish vit model with PP/TP in https://github.com/microsoft/Megatron-DeepSpeed, I recently tried to write this code, can you give me some advice?

tjruwase commented 1 year ago

@etoilestar, apologies for the silence. Are you still interested in this issue? Thanks!

etoilestar commented 1 year ago

thank you, I focus on another part of your project, I will close this issue.

microsoft / DeepSpeed