[BUG] ValueError: optimizer got an empty parameter list

Mr-lonely0 commented 1 month ago

Describe the bug I use llama-2 7b, and when I start stage 2 in EE-Tuning, the bug occurs.

To Reproduce here is llama2_7B_1_exit_mlp_pt.sh I modified:

#!/bin/bash

PROJECT_NAME=EE-TUNE
GROUP_NAME=llama-2-17B-chat-1-EXIT-pt

CURRENT_TIME=`date "+%m%d-%H%M"`

MASTER_NAME=${CURRENT_TIME}

export CUDA_DEVICE_MAX_CONNECTIONS=1
export OMP_NUM_THREADS=4

# Checkpoint configuration
LOAD_PATH=/data3/lk/EE-LLM/model/ee_llm_format/llama-2-7b-chat # your checkpoint path
TOKENIZER_PATH=/data3/lk/llm/model/Llama-2-7b-chat-hf/tokenizer.model # your tokenizer path
CHECKPOINT_PATH=/data3/lk/EE-LLM/model/checkpoints # checkpoint save path

# Data configuration
DATA_HOME=
DATASET_ARXIV=${DATA_HOME}/redpajama-arxiv/all
DATASET_BOOKS=${DATA_HOME}/redpajama-book/all
DATASET_C4=${DATA_HOME}/redpajama-c4/all
DATASET_CC=${DATA_HOME}/redpajama-cc/all
DATASET_STACKEXCHANGE=${DATA_HOME}/redpajama-pile-stackexchange/all
DATASET_CODE=${DATA_HOME}/redpajama-stack-code/all
DATASET_WIKIPEDIA=${DATA_HOME}/redpajama-wiki/all
DATASET_PILE_EUROPARL=${DATA_HOME}/the-pile-europarl/all
DATASET_PILE_FREELAW=${DATA_HOME}/the-pile-freelaw/all
DATASET_PILE_HACKERNEWS=${DATA_HOME}/the-pile-hackernews/all
DATASET_PILE_NIH=${DATA_HOME}/the-pile-nih/all
DATASET_PILE_PHILPAPER=${DATA_HOME}/the-pile-philpaper/all
DATASET_PILE_PMA=${DATA_HOME}/the-pile-pubmed-abstract/all
DATASET_PILE_PMC=${DATA_HOME}/the-pile-pubmed-central/all
DATASET_PILE_USPTO=${DATA_HOME}/the-pile-uspto/all

DATA_PATH="\
    0.0362 ${DATASET_ARXIV} \
    0.0657 ${DATASET_BOOKS} \
    0.2264 ${DATASET_C4} \
    0.4491 ${DATASET_CC} \
    0.0246 ${DATASET_STACKEXCHANGE} \
    0.0810 ${DATASET_CODE} \
    0.0548 ${DATASET_WIKIPEDIA} \
    0.0010 ${DATASET_PILE_EUROPARL} \
    0.0162 ${DATASET_PILE_FREELAW} \
    0.0006 ${DATASET_PILE_HACKERNEWS} \
    0.0005 ${DATASET_PILE_NIH} \
    0.0006 ${DATASET_PILE_PHILPAPER} \
    0.0065 ${DATASET_PILE_PMA} \
    0.0318 ${DATASET_PILE_PMC} \
    0.0050 ${DATASET_PILE_USPTO} \
"

NLAYERS=32
HIDDEN=4096
HEADS=32
SEQ=2048
FFN_SIZE=11008

TP=1
PP=4 # Set pipeline model parallel size to 1

MICRO_BATCH=4 # Reduce batch size for single GPU
GLOBAL_BATCH=16

MASTER_ADDR=127.0.0.1
MASTER_PORT=5901
WORLD_SIZE=1
RANK=0
NPROC_PER_NODE=4 # Set number of processes per node to 1

TRAIN_ITER=40000
EVAL_INTERVAL=50000
SAVE_INTERVAL=20000

DIST_ARGS="
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
    --nproc_per_node $NPROC_PER_NODE \
    --nnodes $WORLD_SIZE \
    --node_rank $RANK \
    "

GPT_ARGS="
    --tensor-model-parallel-size $TP \
    --pipeline-model-parallel-size $PP \
    --query-key-layer-scaling \
    --num-layers $NLAYERS \
    --hidden-size $HIDDEN \
    --num-attention-heads $HEADS \
    --seq-length $SEQ \
    --max-position-embeddings $SEQ \
    --micro-batch-size $MICRO_BATCH \
    --global-batch-size $GLOBAL_BATCH \
    --lr 0.0001 \
    --train-iters $TRAIN_ITER \
    --min-lr 1.0e-5 \
    --lr-warmup-fraction .01 \
    --adam-beta1 0.9 \
    --adam-beta2 0.95 \
    --adam-eps 1e-5 \
    --clip-grad 1.0 \
    --bf16 \
    --disable-bias-linear \
    --use-flash-attn \
    --normalization RMSNorm \
    --position-embedding-type rope \
    --swiglu \
    --untie-embeddings-and-output-weights \
    --padded-vocab-size 32000 \
    --ffn-hidden-size $FFN_SIZE \
    --finetune \
    --tune-exit \
    --untie-exit-output-weights \
    --use-exit-norm \
    --use-exit-mlp \
    --tune-exit-pipeline-parallel-size 4 \
    --exit-layer-nums 10 \
"

DATA_ARGS="
    --data-path $DATA_PATH \
    --tokenizer-type Llama2Tokenizer \
    --tokenizer-model $TOKENIZER_PATH \
    --split 990,9,1 \
"

# OUTPUT_ARGS_BAK="
#     --log-interval 10 \
#     --log-timers-to-tracker \
#     --save-interval $SAVE_INTERVAL \
#     --eval-interval $EVAL_INTERVAL \
#     --eval-iters 1 \
#     --wandb-project $PROJECT_NAME \
#     --wandb-group $GROUP_NAME \
#     --wandb-exp-name $MASTER_NAME \
# "

OUTPUT_ARGS="
    --log-interval 10 \
    --log-timers-to-tracker \
    --save-interval $SAVE_INTERVAL \
    --eval-interval $EVAL_INTERVAL \
    --eval-iters 1 \
"

CUR_DIR=$(cd $(dirname "$0") && pwd)
MEGATRON_ROOT_PATH=$(cd "$CUR_DIR/../../.." && pwd)
cd $MEGATRON_ROOT_PATH

torchrun $DIST_ARGS \
    pretrain_early_exit_gpt.py \
    $GPT_ARGS \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --load $LOAD_PATH \
    --save $CHECKPOINT_PATH

Expected behavior A clear and concise description of what you expected to happen.

Stack trace/logs

Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
using world size: 4, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 4 
WARNING: overriding default arguments for tokenizer_type:SentencePieceTokenizer                        with tokenizer_type:Llama2Tokenizer
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. True
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.95
  adam_eps ........................................ 1e-05
  add_bias_linear ................................. False
  add_position_embedding .......................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... True
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  backward_forward_ratio .......................... 2.0
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ True
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ False
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  check_for_nan_in_loss_and_grad .................. True
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 1
  data_path ....................................... ['0.0362', '/redpajama-arxiv/all', '0.0657', '/redpajama-book/all', '0.2264', '/redpajama-c4/all', '0.4491', '/redpajama-cc/all', '0.0246', '/redpajama-pile-stackexchange/all', '0.0810', '/redpajama-stack-code/all', '0.0548', '/redpajama-wiki/all', '0.0010', '/the-pile-europarl/all', '0.0162', '/the-pile-freelaw/all', '0.0006', '/the-pile-hackernews/all', '0.0005', '/the-pile-nih/all', '0.0006', '/the-pile-philpaper/all', '0.0065', '/the-pile-pubmed-abstract/all', '0.0318', '/the-pile-pubmed-central/all', '0.0050', '/the-pile-uspto/all']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  delay_grad_reduce ............................... True
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 10
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  encoder_num_layers .............................. 32
  encoder_seq_length .............................. 2048
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  eval_interval ................................... 50000
  eval_iters ...................................... 1
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_layer_nums ................................. [10]
  exit_layer_temperature .......................... [1.0]
  exit_layer_weight ............................... [1.0]
  exit_layer_weight_init .......................... [0.0]
  exit_layer_weight_warmup_iters .................. 0
  exit_layer_weight_warmup_style .................. linear
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  expert_model_parallel_size ...................... 1
  expert_parallel ................................. False
  ffn_hidden_size ................................. 11008
  fill_explicit_bubbles ........................... False
  finetune ........................................ True
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 16
  gradient_accumulation_fusion .................... True
  group_query_attention ........................... False
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 4096
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 128
  lazy_mpu_init ................................... None
  load ............................................ /data3/lk/EE-LLM/model/ee_llm_format/llama-2-7b-chat
  load_iteration .................................. 0
  local_rank ...................................... None
  log_batch_size_to_tracker ....................... False
  log_interval .................................... 10
  log_learning_rate_to_tracker .................... True
  log_loss_scale_to_tracker ....................... True
  log_memory_to_tracker ........................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_timers_to_tracker ........................... True
  log_validation_ppl_to_tracker ................... False
  log_world_size_to_tracker ....................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.0001
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. linear
  lr_warmup_fraction .............................. 0.01
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 2048
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  micro_batch_size ................................ 4
  min_loss_scale .................................. 1.0
  min_lr .......................................... 1e-05
  mmap_warmup ..................................... False
  model_spec ...................................... None
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  norm_epsilon .................................... 1e-05
  normalization ................................... RMSNorm
  num_attention_heads ............................. 32
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... None
  num_fill_cooldown_microbatches .................. None
  num_fill_warmup_microbatches .................... None
  num_layers ...................................... 32
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 1
  num_workers ..................................... 2
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  overlap_grad_reduce ............................. False
  overlap_p2p_comm ................................ False
  override_opt_param_scheduler .................... False
  padded_vocab_size ............................... 32000
  params_dtype .................................... torch.bfloat16
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 4
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... rope
  pre_exit ........................................ False
  profile ......................................... False
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ None
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_return_doc_ids ............................ False
  retro_workdir ................................... None
  rotary_percent .................................. 1.0
  rotary_seq_len_interpolation_factor ............. None
  sample_rate ..................................... 1.0
  save ............................................ /data3/lk/EE-LLM/model/checkpoints
  save_interval ................................... 20000
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 2048
  sequence_parallel ............................... False
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  split ........................................... 990,9,1
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  swiglu .......................................... True
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. None
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. /data3/lk/llm/model/Llama-2-7b-chat-hf/tokenizer.model
  tokenizer_type .................................. Llama2Tokenizer
  tracker_log_interval ............................ 1
  train_data_path ................................. None
  train_iters ..................................... 40000
  train_samples ................................... None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 4
  tune_exit ....................................... True
  tune_exit_pipeline_parallel_size ................ 4
  untie_embeddings_and_output_weights ............. True
  untie_exit_output_weights ....................... True
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... None
  use_distributed_optimizer ....................... False
  use_dynamic_exit_layer_weight ................... False
  use_exit_block .................................. False
  use_exit_mlp .................................... True
  use_exit_norm ................................... True
  use_flash_attn .................................. True
  use_mcore_models ................................ False
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. False
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... None
  vocab_size ...................................... None
  wandb_exp_name .................................. default
  wandb_group ..................................... None
  wandb_project ................................... None
  wandb_save_dir .................................. 
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 4
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 4
> building Llama2Tokenizer tokenizer ...
> initializing torch distributed ...
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 4
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: 进入目录“/data3/lk/EE-LLM/megatron/data”
make: 对“default”无需做任何事。
make: 离开目录“/data3/lk/EE-LLM/megatron/data”
>>> done with dataset index builder. Compilation time: 0.070 seconds
> compiling and loading fused kernels ...
>>> done with compiling and loading fused kernels. Compilation time: 1.430 seconds
time to initialize megatron (seconds): 3.385
[after megatron is initialized] datetime: 2024-06-07 22:45:59 
building EarlyExitGPT model ...
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 1750138880
Traceback (most recent call last):
  File "pretrain_early_exit_gpt.py", line 119, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/data3/lk/EE-LLM/megatron/training.py", line 118, in pretrain
    model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
  File "/data3/lk/EE-LLM/megatron/training.py", line 403, in setup_model_and_optimizer
    optimizer = get_megatron_optimizer(model, no_wd_decay_cond,
  File "/data3/lk/EE-LLM/megatron/optimizer/__init__.py", line 75, in get_megatron_optimizer
    optimizer = Adam(param_groups,
  File "/data3/lk/miniconda3/envs/ee_llm/lib/python3.8/site-packages/apex/optimizers/fused_adam.py", line 71, in __init__
    super(FusedAdam, self).__init__(params, defaults)
  File "/data3/lk/miniconda3/envs/ee_llm/lib/python3.8/site-packages/torch/optim/optimizer.py", line 61, in __init__
    raise ValueError("optimizer got an empty parameter list")
ValueError: optimizer got an empty parameter list
 > number of parameters on (tensor, pipeline) model parallel rank (0, 2): 1619066880
 > number of parameters on (tensor, pipeline) model parallel rank (0, 3): 1750142976
Traceback (most recent call last):
  File "pretrain_early_exit_gpt.py", line 119, in <module>
    pretrain(train_valid_test_datasets_provider,
  File "/data3/lk/EE-LLM/megatron/training.py", line 118, in pretrain
    model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
  File "/data3/lk/EE-LLM/megatron/training.py", line 403, in setup_model_and_optimizer
Traceback (most recent call last):
  File "pretrain_early_exit_gpt.py", line 119, in <module>
    optimizer = get_megatron_optimizer(model, no_wd_decay_cond,
  File "/data3/lk/EE-LLM/megatron/optimizer/__init__.py", line 75, in get_megatron_optimizer
    pretrain(train_valid_test_datasets_provider,
  File "/data3/lk/EE-LLM/megatron/training.py", line 118, in pretrain
    optimizer = Adam(param_groups,
  File "/data3/lk/miniconda3/envs/ee_llm/lib/python3.8/site-packages/apex/optimizers/fused_adam.py", line 71, in __init__
    super(FusedAdam, self).__init__(params, defaults)    
model, optimizer, opt_param_scheduler = setup_model_and_optimizer(
  File "/data3/lk/miniconda3/envs/ee_llm/lib/python3.8/site-packages/torch/optim/optimizer.py", line 61, in __init__
  File "/data3/lk/EE-LLM/megatron/training.py", line 403, in setup_model_and_optimizer
    raise ValueError("optimizer got an empty parameter list")
ValueError: optimizer got an empty parameter list
    optimizer = get_megatron_optimizer(model, no_wd_decay_cond,
  File "/data3/lk/EE-LLM/megatron/optimizer/__init__.py", line 75, in get_megatron_optimizer
    optimizer = Adam(param_groups,
  File "/data3/lk/miniconda3/envs/ee_llm/lib/python3.8/site-packages/apex/optimizers/fused_adam.py", line 71, in __init__
    super(FusedAdam, self).__init__(params, defaults)
  File "/data3/lk/miniconda3/envs/ee_llm/lib/python3.8/site-packages/torch/optim/optimizer.py", line 61, in __init__
    raise ValueError("optimizer got an empty parameter list")
ValueError: optimizer got an empty parameter list
 > number of parameters on (tensor, pipeline) model parallel rank (0, 1): 1885409280
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 7708 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7707) of binary: /data3/lk/miniconda3/envs/ee_llm/bin/python
Traceback (most recent call last):
  File "/data3/lk/miniconda3/envs/ee_llm/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/data3/lk/miniconda3/envs/ee_llm/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/data3/lk/miniconda3/envs/ee_llm/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/data3/lk/miniconda3/envs/ee_llm/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/data3/lk/miniconda3/envs/ee_llm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data3/lk/miniconda3/envs/ee_llm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretrain_early_exit_gpt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-06-07_22:46:01
  host      : adminn
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 7709)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-06-07_22:46:01
  host      : adminn
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 7710)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-07_22:46:01
  host      : adminn
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 7707)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Environment (please complete the following information):

Megatron-LM commit ID
PyTorch 1.13.0+cu117
CUDA 11.7
NCCL version

pan-x-c commented 1 month ago

Since the final EE layer is located in Stage 2, subsequent pipeline stages do not contain an EE layer, hence there are no parameters to optimize in these later stages. You only need to set tune_exit_pipeline_parallel_size to 2 to address this issue.

pan-x-c commented 1 month ago

Additionally, bear in mind that after fine-tuning with the aforementioned approach, your final output files will only contain parameters from the first two pipeline stages. You will need to manually merge the parameter folders of the last two pipeline stages from the original checkpoint path with the folders of the first two stages generated by the fine-tuning process to obtain a complete checkpoint.

Mr-lonely0 commented 1 month ago

I modified the script llama2_7B_1_exit_mlp_pt.sh as you said(set tune_exit_pipeline_parallel_size to 2), but I'm still encountering the same error. Could you provide more details?

Looking for your early response :)

#!/bin/bash

PROJECT_NAME=EE-TUNE
GROUP_NAME=llama-2-17B-chat-1-EXIT-pt

CURRENT_TIME=`date "+%m%d-%H%M"`

MASTER_NAME=${CURRENT_TIME}

export CUDA_DEVICE_MAX_CONNECTIONS=1
export OMP_NUM_THREADS=4

# Checkpoint configuration
LOAD_PATH=/data3/lk/EE-LLM/model/ee_llm_format/llama-2-7b-chat # your checkpoint path
TOKENIZER_PATH=/data3/lk/llm/model/Llama-2-7b-chat-hf/tokenizer.model # your tokenizer path
CHECKPOINT_PATH=/data3/lk/EE-LLM/model/checkpoints # checkpoint save path

# Data configuration
DATA_HOME=
DATASET_ARXIV=${DATA_HOME}/redpajama-arxiv/all
DATASET_BOOKS=${DATA_HOME}/redpajama-book/all
DATASET_C4=${DATA_HOME}/redpajama-c4/all
DATASET_CC=${DATA_HOME}/redpajama-cc/all
DATASET_STACKEXCHANGE=${DATA_HOME}/redpajama-pile-stackexchange/all
DATASET_CODE=${DATA_HOME}/redpajama-stack-code/all
DATASET_WIKIPEDIA=${DATA_HOME}/redpajama-wiki/all
DATASET_PILE_EUROPARL=${DATA_HOME}/the-pile-europarl/all
DATASET_PILE_FREELAW=${DATA_HOME}/the-pile-freelaw/all
DATASET_PILE_HACKERNEWS=${DATA_HOME}/the-pile-hackernews/all
DATASET_PILE_NIH=${DATA_HOME}/the-pile-nih/all
DATASET_PILE_PHILPAPER=${DATA_HOME}/the-pile-philpaper/all
DATASET_PILE_PMA=${DATA_HOME}/the-pile-pubmed-abstract/all
DATASET_PILE_PMC=${DATA_HOME}/the-pile-pubmed-central/all
DATASET_PILE_USPTO=${DATA_HOME}/the-pile-uspto/all

DATA_PATH="\
    0.0362 ${DATASET_ARXIV} \
    0.0657 ${DATASET_BOOKS} \
    0.2264 ${DATASET_C4} \
    0.4491 ${DATASET_CC} \
    0.0246 ${DATASET_STACKEXCHANGE} \
    0.0810 ${DATASET_CODE} \
    0.0548 ${DATASET_WIKIPEDIA} \
    0.0010 ${DATASET_PILE_EUROPARL} \
    0.0162 ${DATASET_PILE_FREELAW} \
    0.0006 ${DATASET_PILE_HACKERNEWS} \
    0.0005 ${DATASET_PILE_NIH} \
    0.0006 ${DATASET_PILE_PHILPAPER} \
    0.0065 ${DATASET_PILE_PMA} \
    0.0318 ${DATASET_PILE_PMC} \
    0.0050 ${DATASET_PILE_USPTO} \
"

NLAYERS=32
HIDDEN=4096
HEADS=32
SEQ=2048
FFN_SIZE=11008

TP=1
PP=4 # Set pipeline model parallel size to 1

MICRO_BATCH=4 # Reduce batch size for single GPU
GLOBAL_BATCH=16

MASTER_ADDR=127.0.0.1
MASTER_PORT=5901
WORLD_SIZE=1
RANK=0
NPROC_PER_NODE=4 # Set number of processes per node to 1

TRAIN_ITER=40000
EVAL_INTERVAL=50000
SAVE_INTERVAL=20000

DIST_ARGS="
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
    --nproc_per_node $NPROC_PER_NODE \
    --nnodes $WORLD_SIZE \
    --node_rank $RANK \
    "

GPT_ARGS="
    --tensor-model-parallel-size $TP \
    --pipeline-model-parallel-size $PP \
    --query-key-layer-scaling \
    --num-layers $NLAYERS \
    --hidden-size $HIDDEN \
    --num-attention-heads $HEADS \
    --seq-length $SEQ \
    --max-position-embeddings $SEQ \
    --micro-batch-size $MICRO_BATCH \
    --global-batch-size $GLOBAL_BATCH \
    --lr 0.0001 \
    --train-iters $TRAIN_ITER \
    --min-lr 1.0e-5 \
    --lr-warmup-fraction .01 \
    --adam-beta1 0.9 \
    --adam-beta2 0.95 \
    --adam-eps 1e-5 \
    --clip-grad 1.0 \
    --bf16 \
    --disable-bias-linear \
    --use-flash-attn \
    --normalization RMSNorm \
    --position-embedding-type rope \
    --swiglu \
    --untie-embeddings-and-output-weights \
    --padded-vocab-size 32000 \
    --ffn-hidden-size $FFN_SIZE \
    --finetune \
    --tune-exit \
    --untie-exit-output-weights \
    --use-exit-norm \
    --use-exit-mlp \
    --tune-exit-pipeline-parallel-size 2 \
    --exit-layer-nums 10 \
"

DATA_ARGS="
    --data-path $DATA_PATH \
    --tokenizer-type Llama2Tokenizer \
    --tokenizer-model $TOKENIZER_PATH \
    --split 990,9,1 \
"

# OUTPUT_ARGS_BAK="
#     --log-interval 10 \
#     --log-timers-to-tracker \
#     --save-interval $SAVE_INTERVAL \
#     --eval-interval $EVAL_INTERVAL \
#     --eval-iters 1 \
#     --wandb-project $PROJECT_NAME \
#     --wandb-group $GROUP_NAME \
#     --wandb-exp-name $MASTER_NAME \
# "

OUTPUT_ARGS="
    --log-interval 10 \
    --log-timers-to-tracker \
    --save-interval $SAVE_INTERVAL \
    --eval-interval $EVAL_INTERVAL \
    --eval-iters 1 \
"

CUR_DIR=$(cd $(dirname "$0") && pwd)
MEGATRON_ROOT_PATH=$(cd "$CUR_DIR/../../.." && pwd)
cd $MEGATRON_ROOT_PATH

torchrun $DIST_ARGS \
    pretrain_early_exit_gpt.py \
    $GPT_ARGS \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    --load $LOAD_PATH \
    --save $CHECKPOINT_PATH

pan-x-c commented 1 month ago

After investigation, this indeed is a bug, and we will address it in future updates. The bug arises when using --exit-layer-nums 10 because only Stage 2 contains optimizable EE parameters, while all other pipeline stages do not. This error occurs even when --tune-exit-pipeline-parallel-size 2 is added, because Stage 1 still lacks optimizable EE parameters. There are two possible solutions:

Solution 1: Ensure that all pipeline stages have at least one EE layer, for example by adding an EE layer at both 8 and 16, and setting --tune-exit-pipeline-parallel-size 2.
Solution 2: Keep the existing EE layers but change the checkpoint's parallelism to PP=2, and set --tune-exit-pipeline-parallel-size 1, ensuring all pipeline stages have optimizable EE layer parameters.

Mr-lonely0 commented 1 month ago

Thanks a lot!! I'll try tune all EE layers instead of one, and I'm looking forward to your future updates. Your work is truly commendable!

pan-x-c / EE-LLM

[BUG] ValueError: optimizer got an empty parameter list #12