[BUG] AssertionError: causal mask is only for self attention

The error message does not appear to be related to EE-LLM, but rather seems to be caused by the environment. My inference server can generate content normally without any errors.

The startup log of my server is as follows:

Zarr-based strategies will not be registered because of missing packages
load checkpoint args
Setting num_layers to 32 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 10880 from checkpoint
Setting seq_length to 2048 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting num_query_groups to 1 from checkpoint
Setting group_query_attention to False from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 2048 from checkpoint
Setting position_embedding_type to rope from checkpoint
Setting add_position_embedding to False from checkpoint
Setting use_rotary_position_embeddings to False from checkpoint
Setting rotary_percent to 1.0 from checkpoint
Setting add_bias_linear to False from checkpoint
Setting swiglu to True from checkpoint
Setting untie_embeddings_and_output_weights to True from checkpoint
Setting apply_layernorm_1p to False from checkpoint
Setting normalization to RMSNorm from checkpoint
Setting padded_vocab_size to 32128 from checkpoint
Setting make_vocab_size_divisible_by to 128 from checkpoint
Setting exit_layer_nums to [9, 17] from checkpoint
Setting exit_layer_weight to [0.1, 0.2] from checkpoint
Setting use_exit_mlp to False from checkpoint
Setting use_exit_block to False from checkpoint
Setting use_exit_norm to False from checkpoint
Setting untie_exit_output_weights to True from checkpoint
Setting pre_exit to True from checkpoint
Setting tensor_model_parallel_size to 1 from checkpoint
Setting pipeline_model_parallel_size to 1 from checkpoint
Checkpoint did not provide arguments virtual_pipeline_model_parallel_size
Checkpoint did not provide arguments num_layers_per_virtual_pipeline_stage
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:SentencePieceTokenizer
setting global batch size to 1
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. False
  add_position_embedding .......................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... True
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  backward_forward_ratio .......................... 2.0
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ False
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  check_for_nan_in_loss_and_grad .................. True
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 1
  data_path ....................................... None
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  delay_grad_reduce ............................... True
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 10
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  encoder_num_layers .............................. 32
  encoder_seq_length .............................. 2048
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  eval_interval ................................... 1000
  eval_iters ...................................... 100
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_layer_nums ................................. [9, 17]
  exit_layer_temperature .......................... [1.0, 1.0]
  exit_layer_weight ............................... [0.1, 0.2]
  exit_layer_weight_init .......................... [0.0, 0.0]
  exit_layer_weight_warmup_iters .................. 0
  exit_layer_weight_warmup_style .................. linear
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  expert_model_parallel_size ...................... 1
  expert_parallel ................................. False
  ffn_hidden_size ................................. 10880
  fill_explicit_bubbles ........................... False
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 1
  gradient_accumulation_fusion .................... True
  group_query_attention ........................... False
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 4096
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  iteration ....................................... 36000
  kv_channels ..................................... 128
  lazy_mpu_init ................................... None
  load ............................................ /home/data/shared/checkpoints/EE-LLM-release/EE-LLM-7B-dj-refine-150B/convert-1
  load_iteration .................................. 0
  local_rank ...................................... None
  log_batch_size_to_tracker ....................... False
  log_interval .................................... 100
  log_learning_rate_to_tracker .................... True
  log_loss_scale_to_tracker ....................... True
  log_memory_to_tracker ........................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_timers_to_tracker ........................... False
  log_validation_ppl_to_tracker ................... False
  log_world_size_to_tracker ....................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. None
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. linear
  lr_warmup_fraction .............................. None
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 2048
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  micro_batch_size ................................ 1
  min_loss_scale .................................. 1.0
  min_lr .......................................... 0.0
  mmap_warmup ..................................... False
  model_spec ...................................... None
  no_load_optim ................................... True
  no_load_rng ..................................... True
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  norm_epsilon .................................... 1e-05
  normalization ................................... RMSNorm
  num_attention_heads ............................. 32
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... None
  num_fill_cooldown_microbatches .................. None
  num_fill_warmup_microbatches .................... None
  num_layers ...................................... 32
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 1
  num_workers ..................................... 2
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  overlap_grad_reduce ............................. False
  overlap_p2p_comm ................................ False
  override_opt_param_scheduler .................... False
  padded_vocab_size ............................... 32128
  params_dtype .................................... torch.float32
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  port ............................................ 5000
  position_embedding_type ......................... rope
  pre_exit ........................................ True
  profile ......................................... False
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ None
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_return_doc_ids ............................ False
  retro_workdir ................................... None
  rotary_percent .................................. 1.0
  rotary_seq_len_interpolation_factor ............. None
  sample_rate ..................................... 1.0
  save ............................................ None
  save_interval ................................... None
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 2048
  sequence_parallel ............................... False
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  split ........................................... 969, 30, 1
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  swiglu .......................................... True
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. None
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. /home/data/panxuchen.pxc/code/Megatron-LM/tokenizer/tokenizer.model
  tokenizer_type .................................. SentencePieceTokenizer
  tracker_log_interval ............................ 1
  train_data_path ................................. None
  train_iters ..................................... None
  train_samples ................................... None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 1
  tune_exit ....................................... False
  tune_exit_pipeline_parallel_size ................ 1
  untie_embeddings_and_output_weights ............. True
  untie_exit_output_weights ....................... True
  use_checkpoint_args ............................. True
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... None
  use_distributed_optimizer ....................... False
  use_dynamic_exit_layer_weight ................... False
  use_exit_block .................................. False
  use_exit_mlp .................................... False
  use_exit_norm ................................... False
  use_flash_attn .................................. False
  use_mcore_models ................................ False
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. False
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... None
  vocab_size ...................................... None
  wandb_exp_name .................................. default
  wandb_group ..................................... None
  wandb_project ................................... None
  wandb_save_dir .................................. 
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 1
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1
> building SentencePieceTokenizer tokenizer ...
> initializing torch distributed ...
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: Entering directory '/mnt/data/panxuchen.pxc/dev/Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/mnt/data/panxuchen.pxc/dev/Megatron-LM/megatron/data'
>>> done with dataset index builder. Compilation time: 0.037 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
> compiling and loading fused kernels ...
NCCL version 2.15.5+cuda11.8
>>> done with compiling and loading fused kernels. Compilation time: 0.266 seconds
WARNING: Forcing exit_on_missing_checkpoint to True for text generation.
building EarlyExitGPT model ...
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 6952325120
 loading checkpoint from /home/data/shared/checkpoints/EE-LLM-release/EE-LLM-7B-dj-refine-150B/convert-1 at iteration 36000
 checkpoint version 3.0
  successfully loaded checkpoint from /home/data/shared/checkpoints/EE-LLM-release/EE-LLM-7B-dj-refine-150B/convert-1 at iteration 36000
 * Serving Flask app 'megatron.early_exit_text_generation_server'
 * Debug mode: off

The specific prompt request and the corresponding response log are as follows:

request IP: 127.0.0.1
{"prompts": ["translate the sentences to English.\n\nexample: \n\nsource: \n\n\u897f\u7c73\u8bfa\u592b\u8bf4\uff0c2013 \u5e74\u4ed6\u5728\u300a\u521b\u667a\u8d62\u5bb6\u300b\u8282\u76ee\u4e2d\u9732\u9762\u540e\uff0c\u516c\u53f8\u7684\u9500\u552e\u989d\u5927\u589e\uff0c\u5f53\u65f6\u8282\u76ee\u7ec4\u62d2\u7edd\u5411\u8fd9\u5bb6\u521d\u521b\u516c\u53f8\u6295\u8d44\u3002\n\ntarget: \n\nSiminoff said sales boosted after his 2013 appearance in a Shark Tank episode where the show panel declined funding the startup.\n\nsource: \n\n2017 \u5e74\u5e74\u672b\uff0c\u897f\u7c73\u8bfa\u592b\u51fa\u73b0\u5728 QVC \u7535\u89c6\u9500\u552e\u9891\u9053\u3002\n\ntarget: \n\nIn late 2017, Siminoff appeared on shopping television channel QVC.\n\nsource: \n\n\u94c3\u58f0 (Ring) \u516c\u53f8\u8fd8\u4e0e\u7ade\u4e89\u5bf9\u624b ADT \u5b89\u4fdd\u516c\u53f8\u5728\u4e00\u8d77\u5b98\u53f8\u4e2d\u8fbe\u6210\u4e86\u5ead\u5916\u548c\u89e3\u3002\n\ntarget: \n\nRing also settled a lawsuit with competing security company, the ADT Corporation.\n\ntranslate the following sentences:\n\nsource: \n\n\u4ed6\u8865\u5145\u9053\uff1a\u201c\u6211\u4eec\u73b0\u5728\u6709 4 \u4e2a\u6708\u5927\u6ca1\u6709\u7cd6\u5c3f\u75c5\u7684\u8001\u9f20\uff0c\u4f46\u5b83\u4eec\u66fe\u7ecf\u5f97\u8fc7\u8be5\u75c5\u3002\u201d\n\ntarget: \n\n"], "tokens_to_generate": 200, "top_k": 1, "logprobs": true, "random_seed": 9958, "echo_prompts": false, "early_exit_thres": 0.2, "exit_layers": [], "use_early_exit": true, "print_max_prob": false, "top_p": 0, "top_p_decay": 0.0, "top_p_bound": 0.0, "temperature": 0.0, "add_BOS": false, "stop_sequences": null, "prevent_newline_after_colon": false, "length_penalty": 1}
start time:  2024-06-08 10:26:42.342346
Response(use 3.3466479778289795s): ['The company has been dealing with a series of respiratory illnesses in recent years.\n\ntranslate the following sentences:\n\nsource: \n\n他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充道果 他补充']

The error might be related to flash-attention or PyTorch, because the version of Pytorch you are using is quite high, whereas EE-LLM was developed on a relatively older version. I recommend trying out the docker image suggested in the README (nvcr.io/nvidia/pytorch:22.12-py3) to see if it resolves the issue.

pan-x-c / EE-LLM

[BUG] AssertionError: causal mask is only for self attention #13