princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
https://arxiv.org/abs/2310.06694
MIT License
533 stars 39 forks source link

Start training but only output config information #61

Open Beatlesso opened 6 months ago

Beatlesso commented 6 months ago

After I run pruning.sh, the command line outputs "Starting training..." and the config message. However, I did not get any other output after running it overnight. In fact, I'm running the program on two 4090 GPU, both of which have 16.6GB of memory and consume only 25W, as if the program is not running at all. This makes me think that it's not a slow running problem, but that there may be something else going on. Is this warning "The memory monitor only works on CUDA devices, but the model is on cpu." causing the problem? Can you help me figure out the cause of the problem? Below is my output:

/opt/conda/lib/python3.10/site-packages/composer/callbacks/speed_monitor.py:120: UserWarning: gpu_flop count not found for None with precision: amp_bf16; MFU cannot be calculated and reported. gpu_flops_available can be manuallyoverridden by setting gpu_flops_available in SpeedMonitor.
  warnings.warn(
/opt/conda/lib/python3.10/site-packages/composer/callbacks/memory_monitor.py:94: UserWarning: The memory monitor only works on CUDA devices, but the model is on cpu.
  warnings.warn(f'The memory monitor only works on CUDA devices, but the model is on {model_device.type}.')
Logging config...
data_local: LLM-Shearing/llmshearing/data/for_prune
data_remote: null
tokenizer_name: LLM-Shearing/llmshearing/meta-llama/Llama-2-7b-hf
max_seq_len: 512
global_seed: 17
run_name: llama2_7b_pruning_scaling_doremi_to2.7b_sl512
model:
  name: mosaic_llama2_7b
  path: LLM-Shearing/llmshearing/meta-llama/Llama-2-7b-hf/mosaic-7B/state_dict.pt
  init_device: cpu
  tokenizer_name: ${tokenizer_name}
  d_model: 4096
  n_heads: 32
  n_layers: 32
  intermediate_size: 11008
  max_seq_len: ${max_seq_len}
  vocab_size: 32000
  init_std: 0.02
  attn_pdrop: 0.0
  resid_pdrop: 0.0
  emb_pdrop: 0.0
  attn_impl: flash
  rms_norm_eps: 1.0e-05
  l0_module:
    start_sparsity: 0.0
    target_sparsity: 0.5
    pruning_modules:
    - head
    - intermediate
    - layer
    - hidden
    lagrangian_warmup_steps: 640ba
    target_model:
      d_model: 2560
      n_layers: 32
      n_heads: 20
      intermediate_size: 6912
      vocab_size: 32000
    eval_target_model: false
  set_names:
  - cc
  - github
  - book
  - stackexchange
  - wiki
  - arxiv
  - c4-rp
tokenizer:
  type: hftokenizer
  args:
    tokenizer_name: ${tokenizer_name}
    max_seq_len: ${max_seq_len}
train_loader:
  name: text
  dataset:
    local: ${data_local}
    remote: ${data_remote}
    split: train_small
    shuffle: true
    tokenizer_name: ${tokenizer_name}
    max_seq_len: ${max_seq_len}
    shuffle_seed: ${global_seed}
    is_uint16: true
  drop_last: true
  num_workers: 0
  prefetch_factor: null
  persistent_workers: false
eval_loader:
  name: text
  dataset:
    local: ${data_local}
    remote: ${data_remote}
    split: eval_merge
    shuffle: false
    tokenizer_name: ${tokenizer_name}
    max_seq_len: ${max_seq_len}
    shuffle_seed: ${global_seed}
    is_uint16: true
  drop_last: false
  num_workers: 8
scheduler:
  t_warmup: 320ba
  alpha_f: 0.1
optimizer:
  lr: 0.0001
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0
algorithms:
  gradient_clipping:
    clipping_type: norm
    clipping_threshold: 1.0
max_duration: 3200ba
eval_interval: 50ba
eval_subset_num_batches: 1000
global_train_batch_size: 2
seed: ${global_seed}
device_eval_batch_size: 1
device_train_microbatch_size: 1
precision: amp_bf16
fsdp_config:
  sharding_strategy: FULL_SHARD
  mixed_precision: DEFAULT
  activation_checkpointing: true
  activation_cpu_offload: false
  verbose: false
progress_bar: false
log_to_console: true
console_log_interval: 1ba
callbacks:
  speed_monitor:
    window_size: 10
  memory_monitor: {}
  lr_monitor: {}
loggers:
  wandb:
    project: pruning
    name: ${run_name}
    entity: pruning
    init_kwargs:
      mode: offline
      dir: Shear-output/test_release_pruning_full/llama2_7b_pruning_scaling_doremi_to2.7b_sl512
      project: pruning
      name: llama2_7b_pruning_scaling_doremi_to2.7b_sl512
      entity: pruning
save_interval: 3200ba
save_folder: Shear-output/test_release_pruning_full/llama2_7b_pruning_scaling_doremi_to2.7b_sl512
eval_first: false
autoresume: false
dist_timeout: 1800.0
n_gpus: 2
device_train_batch_size: 1
device_train_grad_accum: 1
n_params: 6738773034

Starting training...
******************************
Config:
enabled_algorithms/GradientClipping: true
node_name: unknown because NODENAME environment variable not set
num_gpus_per_node: 2
num_nodes: 1
rank_zero_seed: 17
******************************

GPU information

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:1A:00.0 Off |                  Off |
| 44%   28C    P8              24W / 450W |  16618MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:89:00.0 Off |                  Off |
| 44%   27C    P8              22W / 450W |  16618MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

running bash

# Run in bash, it will automatically use resources available in the current environment
composer $TRAIN_SCRIPT \
    $config_file \
    run_name=${run_name} \
    data_local=${data_local} \
    eval_loader.dataset.split=${eval_split_name} \
    global_train_batch_size=${global_train_batch_size} \
    device_train_microbatch_size=${device_train_microbatch_size} \
    device_eval_batch_size=${device_eval_batch_size} \
    max_seq_len=${max_seq_len} \
    max_duration=${max_duration} \
    eval_first=false \
    scheduler.t_warmup=${t_warmup} \
    save_folder=${save_dir} \
    loggers.wandb.init_kwargs.dir=${wandb_dir} \
    eval_interval=${eval_interval} \
    save_interval=${save_interval} \
    optimizer.lr=${lr} \
    optimizer.lag_lr=${lag_lr} \
    model.path=${path} \
    model.l0_module.lagrangian_warmup_steps=${lagr_warmup} \
    model.l0_module.pruning_modules='[head,intermediate,layer,hidden]' \
    model.l0_module.eval_target_model=${eval_target_model} \
    model.l0_module.target_model.d_model=${target_d_model} \
    model.l0_module.target_model.n_heads=${target_n_heads} \
    model.l0_module.target_model.n_layers=${target_n_layers} \
    model.l0_module.target_model.intermediate_size=${target_intermediate_size} \
    callbacks.data_loading.dynamic=${dynamic} \
    callbacks.data_loading.set_names=${set_names} \
    callbacks.data_loading.proportion=${proportion} \
    callbacks.data_loading.update_type=${update_type} \
    callbacks.data_loading.target_loss=${target_loss} \
    train_loader.num_workers=0 \
    train_loader.prefetch_factor=null \
    train_loader.persistent_workers=false \
    autoresume=false
xiamengzhou commented 6 months ago

Seems that you are having a hanging issue. Could you refer to this issue and see if the solution helps?

Beatlesso commented 6 months ago

Seems that you are having a hanging issue. Could you refer to this issue and see if the solution helps?

Thanks, that solved my problem. One more question for you, what is the minimum amount of GPU memory needed to run this pruning algorithm to prune Llama2? 4090 doesn't seem to be enough memory, and multiple GPUs don't solve the problem.

xiamengzhou commented 5 months ago

Based on my experience, a minimum of 2 A100 80GB GPUs are required training with the Llama2-7b model with a maximum sequence length of 2048.