princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
https://arxiv.org/abs/2310.06694
MIT License
533 stars 39 forks source link

Start training but nothing continue #53

Closed logan-zou closed 6 months ago

logan-zou commented 7 months ago

Thanks for your contribute! When I tried to replicate the pruning of this project, I encountered problems. I have ran the pruning.sh, and the script started successfully. But after it output 'Starting training...', there is nothing continue and I found my GPU isn't in utilize, but there is no error information. I have waited for several hours, but nothing changed. Could you help me figure out this bug? Below is my output:


===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin python/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary python/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin python/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary python/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Start running 
Initializing model...
Tried to build Llama model with cfg.name=mosaic_llama2_7b
********** Initializing L0 Module **********
***** head *****
z.shape torch.Size([32, 32])
size 32
***** intermediate *****
z.shape torch.Size([32, 11008])
size 11008
***** layer *****
z.shape torch.Size([32])
size 32
***** hidden *****
z.shape torch.Size([4096])
size 4096
prunable model size: 6476005376
ComposerMosaicLlama(
  (model): LlamaModel(
    (l0_module): L0Module(
      (masks): ModuleDict(
        (head): Mask()
        (intermediate): Mask()
        (layer): Mask()
        (hidden): Mask()
      )
      (lambdas): ParameterDict(
          (lambda_1): Parameter containing: [torch.meta.FloatTensor of size ]
          (lambda_1_head): Parameter containing: [torch.meta.FloatTensor of size ]
          (lambda_1_hidden): Parameter containing: [torch.meta.FloatTensor of size ]
          (lambda_1_intermediate): Parameter containing: [torch.meta.FloatTensor of size ]
          (lambda_1_layer): Parameter containing: [torch.meta.FloatTensor of size ]
          (lambda_2): Parameter containing: [torch.meta.FloatTensor of size ]
          (lambda_2_head): Parameter containing: [torch.meta.FloatTensor of size ]
          (lambda_2_hidden): Parameter containing: [torch.meta.FloatTensor of size ]
          (lambda_2_intermediate): Parameter containing: [torch.meta.FloatTensor of size ]
          (lambda_2_layer): Parameter containing: [torch.meta.FloatTensor of size ]
      )
    )
    (transformer): ModuleDict(
      (wte): Embedding(32000, 4096)
      (blocks): ModuleList(
        (0-31): 32 x LlamaBlock(
          (ln_1): LlamaRMSNorm()
          (attn): LlamaAttention(
            (wq): Linear(in_features=4096, out_features=4096, bias=False)
            (wk): Linear(in_features=4096, out_features=4096, bias=False)
            (wv): Linear(in_features=4096, out_features=4096, bias=False)
            (out_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (rotary_emb): LlamaRotaryEmbedding()
          )
          (ln_2): LlamaRMSNorm()
          (mlp): LlamaMLP(
            (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
            (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
            (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          )
        )
      )
      (ln_f): LlamaRMSNorm()
      (output): Linear(in_features=4096, out_features=32000, bias=False)
    )
  )
)
{'start_sparsity': 0.0, 'target_sparsity': 0.5, 'pruning_modules': ['head', 'intermediate', 'layer', 'hidden'], 'lagrangian_warmup_steps': '640ba', 'target_model': {'d_model': 2560, 'n_layers': 24, 'n_heads': 20, 'intermediate_size': 6912, 'vocab_size': 32000}, 'eval_target_model': False}
Loaded model from path: LLM-Shearing/models/Llama-2-7b-composer/state_dict.pt
Model load state dict result:  _IncompatibleKeys(missing_keys=['model.l0_module.masks.head.z_loga', 'model.l0_module.masks.intermediate.z_loga', 'model.l0_module.masks.layer.z_loga', 'model.l0_module.masks.hidden.z_loga', 'model.l0_module.lambdas.lambda_1', 'model.l0_module.lambdas.lambda_1_head', 'model.l0_module.lambdas.lambda_1_hidden', 'model.l0_module.lambdas.lambda_1_intermediate', 'model.l0_module.lambdas.lambda_1_layer', 'model.l0_module.lambdas.lambda_2', 'model.l0_module.lambdas.lambda_2_head', 'model.l0_module.lambdas.lambda_2_hidden', 'model.l0_module.lambdas.lambda_2_intermediate', 'model.l0_module.lambdas.lambda_2_layer', 'model.transformer.blocks.0.attn.rotary_emb.inv_freq', 'model.transformer.blocks.1.attn.rotary_emb.inv_freq', 'model.transformer.blocks.2.attn.rotary_emb.inv_freq', 'model.transformer.blocks.3.attn.rotary_emb.inv_freq', 'model.transformer.blocks.4.attn.rotary_emb.inv_freq', 'model.transformer.blocks.5.attn.rotary_emb.inv_freq', 'model.transformer.blocks.6.attn.rotary_emb.inv_freq', 'model.transformer.blocks.7.attn.rotary_emb.inv_freq', 'model.transformer.blocks.8.attn.rotary_emb.inv_freq', 'model.transformer.blocks.9.attn.rotary_emb.inv_freq', 'model.transformer.blocks.10.attn.rotary_emb.inv_freq', 'model.transformer.blocks.11.attn.rotary_emb.inv_freq', 'model.transformer.blocks.12.attn.rotary_emb.inv_freq', 'model.transformer.blocks.13.attn.rotary_emb.inv_freq', 'model.transformer.blocks.14.attn.rotary_emb.inv_freq', 'model.transformer.blocks.15.attn.rotary_emb.inv_freq', 'model.transformer.blocks.16.attn.rotary_emb.inv_freq', 'model.transformer.blocks.17.attn.rotary_emb.inv_freq', 'model.transformer.blocks.18.attn.rotary_emb.inv_freq', 'model.transformer.blocks.19.attn.rotary_emb.inv_freq', 'model.transformer.blocks.20.attn.rotary_emb.inv_freq', 'model.transformer.blocks.21.attn.rotary_emb.inv_freq', 'model.transformer.blocks.22.attn.rotary_emb.inv_freq', 'model.transformer.blocks.23.attn.rotary_emb.inv_freq', 'model.transformer.blocks.24.attn.rotary_emb.inv_freq', 'model.transformer.blocks.25.attn.rotary_emb.inv_freq', 'model.transformer.blocks.26.attn.rotary_emb.inv_freq', 'model.transformer.blocks.27.attn.rotary_emb.inv_freq', 'model.transformer.blocks.28.attn.rotary_emb.inv_freq', 'model.transformer.blocks.29.attn.rotary_emb.inv_freq', 'model.transformer.blocks.30.attn.rotary_emb.inv_freq', 'model.transformer.blocks.31.attn.rotary_emb.inv_freq'], unexpected_keys=[])
Having missing rotary_emb.inv_freq keys is normal
cfg.n_params=6.74e+09
model.num_fwd_flops=6.40e+13
Building train loader...
NCCL version 2.14.3+cuda11.7
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Building eval loader...
Group 0: 291 tensors 6738415616 params 1.00e-04 lr
Group 1: 4 tensors 357408 params 1.00e+00 lr
Group 2: 10 tensors 10 params -1.00e+00 lr
Target loss: [1.8712, 0.6883, 2.0325, 1.5353, 1.6297, 1.356, 2.0328]
Building trainer...
wandb: Tracking run with wandb version 0.15.12
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
wandb: WARNING URL not available in offline run
python/lib/python3.9/site-packages/composer/callbacks/speed_monitor.py:120: UserWarning: gpu_flop count not found for None with precision: amp_bf16; MFU cannot be calculated and reported. gpu_flops_available can be manuallyoverridden by setting gpu_flops_available in SpeedMonitor.
  warnings.warn(
python/lib/python3.9/site-packages/composer/callbacks/memory_monitor.py:94: UserWarning: The memory monitor only works on CUDA devices, but the model is on meta.
  warnings.warn(f'The memory monitor only works on CUDA devices, but the model is on {model_device.type}.')
Logging config...
data_local: LLM-Shearing/llmshearing/data/sample_redpajama
data_remote: null
tokenizer_name: model/llama-7b
max_seq_len: 4096
global_seed: 18
run_name: llama2_7b_pruning_scaling_doremi_to2.7b_sl4096
model:
  name: mosaic_llama2_7b
  path: LLM-Shearing/models/Llama-2-7b-composer/state_dict.pt
  init_device: meta
  tokenizer_name: ${tokenizer_name}
  d_model: 4096
  n_heads: 32
  n_layers: 32
  intermediate_size: 11008
  max_seq_len: ${max_seq_len}
  vocab_size: 32000
  init_std: 0.02
  attn_pdrop: 0.0
  resid_pdrop: 0.0
  emb_pdrop: 0.0
  attn_impl: flash
  rms_norm_eps: 1.0e-05
  l0_module:
    start_sparsity: 0.0
    target_sparsity: 0.5
    pruning_modules:
    - head
    - intermediate
    - layer
    - hidden
    lagrangian_warmup_steps: 640ba
    target_model:
      d_model: 2560
      n_layers: 24
      n_heads: 20
      intermediate_size: 6912
      vocab_size: 32000
    eval_target_model: false
  set_names:
  - cc
  - github
  - book
  - stackexchange
  - wiki
  - arxiv
  - c4-rp
tokenizer:
  type: hftokenizer
  args:
    tokenizer_name: ${tokenizer_name}
    max_seq_len: ${max_seq_len}
train_loader:
  name: text
  dataset:
    local: ${data_local}
    remote: ${data_remote}
    split: train_small
    shuffle: true
    tokenizer_name: ${tokenizer_name}
    max_seq_len: ${max_seq_len}
    shuffle_seed: ${global_seed}
    is_uint16: true
  drop_last: true
  num_workers: 20
  prefetch_factor: null
  persistent_workers: false
eval_loader:
  name: text
  dataset:
    local: ${data_local}
    remote: ${data_remote}
    split: eval_merge
    shuffle: false
    tokenizer_name: ${tokenizer_name}
    max_seq_len: ${max_seq_len}
    shuffle_seed: ${global_seed}
    is_uint16: true
  drop_last: false
  num_workers: 8
scheduler:
  t_warmup: 320ba
  alpha_f: 0.1
optimizer:
  lr: 0.0001
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0
algorithms:
  gradient_clipping:
    clipping_type: norm
    clipping_threshold: 1.0
max_duration: 3200ba
eval_interval: 50ba
eval_subset_num_batches: 1000
global_train_batch_size: 32
seed: ${global_seed}
device_eval_batch_size: 2
device_train_microbatch_size: 2
precision: amp_bf16
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: DEFAULT
  activation_checkpointing: true
  activation_cpu_offload: false
  verbose: false
progress_bar: false
log_to_console: true
console_log_interval: 1ba
callbacks:
  speed_monitor:
    window_size: 10
  memory_monitor: {}
  lr_monitor: {}
loggers:
  wandb:
    project: pruning
    name: ${run_name}
    entity: pruning
    init_kwargs:
      mode: offline
      dir: LLM-Shearing/pruning_output/llama2_7b_pruning_scaling_doremi_to2.7b_sl4096
      project: pruning
      name: llama2_7b_pruning_scaling_doremi_to2.7b_sl4096
      entity: pruning
save_interval: 320ba
save_folder:LLM-Shearing/pruning_output/llama2_7b_pruning_scaling_doremi_to2.7b_sl4096
eval_first: false
autoresume: false
dist_timeout: 1800.0
n_gpus: 4
device_train_batch_size: 8
device_train_grad_accum: 4
n_params: 6738773034

Starting training...
xiamengzhou commented 7 months ago

Hi! Could you try using train_loader.num_workers=0 as specified in the pruning script? We haven't made it work to use multiple workers for dynamic batch loading yet.

logan-zou commented 7 months ago

Hi! Could you try using train_loader.num_workers=0 as specified in the pruning script? We haven't made it work to use multiple workers for dynamic batch loading yet.

Thanks for replying, however, I have tried using train_loader.num_workers = 0, but the bug is still exists. I have done some debug for it, and I find the bug may be in eval_loader, but I can't figure it. Is there any other method?

xiamengzhou commented 7 months ago

and I find the bug may be in eval_loader

Could you elaborate more on this and provide more information on where exactly the program hangs?

fwtan commented 6 months ago

The training gets stuck at "spinning the dataloader" (setting python_log_level to DEBUG in the Trainer). This may be a potential bug in llm-foundry: https://github.com/mosaicml/llm-foundry/issues/436

The solution is to clean the shared memory, by either calling streaming.base.util.clean_stale_shared_memory() or following the suggestion from https://github.com/mosaicml/llm-foundry/issues/436#issuecomment-1627712085

xiamengzhou commented 6 months ago

@fwtan Thanks for sharing your solution! I am surprised that I didn't run into this issue at all when doing my experiments.

logan-zou commented 6 months ago

The training gets stuck at "spinning the dataloader" (setting python_log_level to DEBUG in the Trainer). This may be a potential bug in llm-foundry: mosaicml/llm-foundry#436

The solution is to clean the shared memory, by either calling streaming.base.util.clean_stale_shared_memory() or following the suggestion from mosaicml/llm-foundry#436 (comment)

I have figured out this problem with your method, thanks a lot for helping!