tatsu-lab / stanford_alpaca

Code and documentation to train Stanford's Alpaca models, and generate the data.
https://crfm.stanford.edu/2023/03/13/alpaca.html
Apache License 2.0
29.39k stars 4.03k forks source link

Training bug for 13b, 30b, and 65b #285

Open alexgshaw opened 1 year ago

alexgshaw commented 1 year ago

Has anyone been able to finetune any of the models larger than 7b successfully? I'm training on 8 A100s with 80GB of RAM each which is more than enough space.

The problem I'm running into is the first loss being massive (1e5) and subsequent losses being 0 after the first step. Not sure how to fix this or what is causing this, as the 7b model trains fine. I'm training with deepspeed launcher.

Here's an example of the output when training the 65b model.

  0%|          | 0/82 [00:00<?, ?it/s]
  1%|          | 1/82 [02:38<3:34:18, 158.74s/it]
  2%|▏         | 2/82 [04:39<3:01:49, 136.37s/it]
  4%|▎         | 3/82 [06:40<2:50:13, 129.28s/it]
  5%|▍         | 4/82 [08:39<2:43:07, 125.48s/it]
  6%|▌         | 5/82 [10:39<2:38:05, 123.19s/it]
  7%|▋         | 6/82 [12:39<2:34:54, 122.30s/it]
  9%|▊         | 7/82 [14:39<2:31:49, 121.46s/it]
 10%|▉         | 8/82 [16:38<2:28:55, 120.75s/it]
 11%|█         | 9/82 [18:38<2:26:39, 120.54s/it]
 12%|█▏        | 10/82 [20:38<2:24:23, 120.33s/it]

{'loss': 121486.8, 'learning_rate': 0.0, 'epoch': 0.02}

 12%|█▏        | 10/82 [20:38<2:24:23, 120.33s/it]
 13%|█▎        | 11/82 [22:38<2:22:06, 120.10s/it]
 15%|█▍        | 12/82 [24:38<2:20:08, 120.12s/it]
 16%|█▌        | 13/82 [26:38<2:18:11, 120.17s/it]
 17%|█▋        | 14/82 [28:38<2:16:09, 120.15s/it]
 18%|█▊        | 15/82 [30:39<2:14:13, 120.21s/it]
 20%|█▉        | 16/82 [32:39<2:12:09, 120.15s/it]
 21%|██        | 17/82 [34:38<2:09:52, 119.89s/it]
 22%|██▏       | 18/82 [36:37<2:07:41, 119.71s/it]
 23%|██▎       | 19/82 [38:36<2:05:26, 119.47s/it]
 24%|██▍       | 20/82 [40:36<2:03:32, 119.55s/it]

{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.05}

 24%|██▍       | 20/82 [40:36<2:03:32, 119.55s/it]
 26%|██▌       | 21/82 [42:42<2:03:38, 121.61s/it]
 27%|██▋       | 22/82 [44:43<2:01:21, 121.36s/it]
 28%|██▊       | 23/82 [46:42<1:58:48, 120.81s/it]
 29%|██▉       | 24/82 [48:42<1:56:19, 120.34s/it]
 30%|███       | 25/82 [50:41<1:54:01, 120.03s/it]
 32%|███▏      | 26/82 [52:39<1:51:33, 119.53s/it]
 33%|███▎      | 27/82 [54:39<1:49:29, 119.44s/it]
 34%|███▍      | 28/82 [56:38<1:47:29, 119.43s/it]
 35%|███▌      | 29/82 [58:37<1:45:17, 119.20s/it]
 37%|███▋      | 30/82 [1:00:37<1:43:29, 119.42s/it]

{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.07}
alexgshaw commented 1 year ago

My finetuning arguments:

    --model_name_or_path /home/ashaw8/compute/$MODEL_DIR/$MODEL_NAME \
    --data_path ./alpaca_data.json \
    --run_name $RUN_NAME \
    --bf16 True \
    --output_dir $OUTPUT_DIR \
    --logging_dir $LOGGING_DIR \
    --num_train_epochs 0.2 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy no \
    --save_strategy no \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 1e-5 \
    --adam_beta1 0.9 \
    --adam_beta2 0.95 \
    --warmup_steps 50 \
    --lr_scheduler_type linear \
    --weight_decay 0.1 \
    --deepspeed ./configs/default_offload_opt_param.json \
    --tf32 True \
    --logging_strategy steps \
    --logging_steps 10 \
    --report_to wandb \
yh0903 commented 1 year ago

Same here. Could you share how did you slove this eventually? Thanks

yxchng commented 1 year ago

are you able to train with batch size 4 as in the readme?

alexgshaw commented 1 year ago

Haven't solved it yet, but switching from the huggingface trainer to pytorch lightning might solve the issue. If I can get it to work I'll post a link to a repo with everything set up.

Also, I switched to a different machine with V100s instead of A100s and 13b works on there. Could also be a version difference because I can work with docker containers on the V100 machine but only with venvs on the A100 machine (admins are stingy about root access).

Also, yes, I'm able to train with batch size of 4, but that does not make a difference.

alexgshaw commented 1 year ago

It seems like this might be a related issue:

https://github.com/huggingface/transformers/issues/14531

I turned off bf16 and it fixed my issue with 13b and 30b. Without bf16 I can't fit 65b onto my GPUs so I haven't tested that one yet.

Any idea why bf16 is causing this problem? I think it's preventing the optimizer from stepping but have no idea why.