Fine-tuning Llama 2 based model

mikefrandsen commented 1 year ago

The main FastChat README references: Fine-tuning Vicuna-7B with Local GPUs

Writing this up as an "issue" but it's really more of a documentation request.

I'd like an example that fine tunes a Llama 2 model -- perhaps with at least a couple GPU hardware configs -- and it's been hard to find what command line setting should change and what is fine as-is.

Is Llama 2 similar enough that similar switches should work or is it fundamentally different in some way? Some of the switches are specific to the GPU involved; what's a good reference for this? Are there guidelines for memory/resource/time requirements/epochs or some rules of thumb available? Libraries and versions needed for FastChat seem only loosely spelled out. (Dependences lists torch without a version and some requirements like ninja and flash-attn weren't found until training time. https://github.com/lm-sys/FastChat/blob/main/pyproject.toml )

I assume some of this info might be spread out here and there but maybe the README could have more pointers to some of this (?)

mvuthegoat commented 1 year ago

Fine-tuning llama 2 should be similar to fine-tuning Vicuna since Vicuna uses llama as its base model. You can also check out the scripts folder for more training references

mikefrandsen commented 1 year ago

Will any of the scripts run with minimal changes on a V100 or are 4x A100s required? (8x V100 16GB = 128GB VRAM, 4xA100 40GB = 160GB VRAM; always running out of CUDA memory)

#!/bin/bash

FC=../FastChat

DATA_PATH=${FC}/data/dummy_conversation.json
# Guess: Both s2 and s3 are available
PATH_TO_DEEPSPEED_CONFIG=${FC}/playground/deepspeed_config_s2.json

deepspeed ${FC}/fastchat/train/train_lora.py \
    --model_name_or_path lmsys/vicuna-7b-v1.3  \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --data_path $DATA_PATH \
    --output_dir ./checkpoints \
    --num_train_epochs 150 \
    --fp16 True \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "steps" \
    --eval_steps 100  \
    --save_strategy "steps" \
    --save_steps 200 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_strategy "steps" \
    --logging_steps 1 \
    --tf32 False \
    --model_max_length 2048 \
    --q_lora False \
    --deepspeed $PATH_TO_DEEPSPEED_CONFIG \
    --gradient_checkpointing True \
    --flash_attn False

Fails quickly with CUDA out of memory.

Is there a hello world training to see that things are working?

merrymercy commented 1 year ago

I updated some docs for training (https://github.com/lm-sys/FastChat#fine-tuning-vicuna-7b-with-local-gpus). Llama2-7b/13b and llama1-7b/13b have the exactly same architecture, so all old scripts and configs should just work.

mikefrandsen commented 1 year ago

I am getting Lora training to work on T5 variant on machine with V100s and see elsewhere that "--fp16 True" triggers overflow on T5 models. However, that's the only training script I've been able to customize and get to work given that others give either:

CUDA out of memory (even with 8x V100s)
and/or problem saving final model File "/home/user/huggingface_repos/FastChat/fastchat/train/train_flant5.py", line 79, in cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()} RuntimeError: CUDA error: invalid argument

CohenQU commented 1 year ago

@mikefrandsen I am running the same task (fine-tune llama-7b) on the same architecture 8xV100 (with 16GB GPU memory for each), and got the same "CUDA out-of-memory" error, did you fix this?

nouf01 commented 9 months ago

Any solution for CUDA out of memory when training llama-7b?

lm-sys / FastChat

Fine-tuning Llama 2 based model #2120