Open mikefrandsen opened 1 year ago
Fine-tuning llama 2 should be similar to fine-tuning Vicuna since Vicuna uses llama as its base model. You can also check out the scripts
folder for more training references
Will any of the scripts run with minimal changes on a V100 or are 4x A100s required? (8x V100 16GB = 128GB VRAM, 4xA100 40GB = 160GB VRAM; always running out of CUDA memory)
#!/bin/bash
FC=../FastChat
DATA_PATH=${FC}/data/dummy_conversation.json
# Guess: Both s2 and s3 are available
PATH_TO_DEEPSPEED_CONFIG=${FC}/playground/deepspeed_config_s2.json
deepspeed ${FC}/fastchat/train/train_lora.py \
--model_name_or_path lmsys/vicuna-7b-v1.3 \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--data_path $DATA_PATH \
--output_dir ./checkpoints \
--num_train_epochs 150 \
--fp16 True \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "steps" \
--eval_steps 100 \
--save_strategy "steps" \
--save_steps 200 \
--save_total_limit 2 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_strategy "steps" \
--logging_steps 1 \
--tf32 False \
--model_max_length 2048 \
--q_lora False \
--deepspeed $PATH_TO_DEEPSPEED_CONFIG \
--gradient_checkpointing True \
--flash_attn False
Fails quickly with CUDA out of memory.
Is there a hello world training to see that things are working?
I updated some docs for training (https://github.com/lm-sys/FastChat#fine-tuning-vicuna-7b-with-local-gpus). Llama2-7b/13b and llama1-7b/13b have the exactly same architecture, so all old scripts and configs should just work.
I am getting Lora training to work on T5 variant on machine with V100s and see elsewhere that "--fp16 True" triggers overflow on T5 models. However, that's the only training script I've been able to customize and get to work given that others give either:
@mikefrandsen I am running the same task (fine-tune llama-7b) on the same architecture 8xV100 (with 16GB GPU memory for each), and got the same "CUDA out-of-memory" error, did you fix this?
Any solution for CUDA out of memory when training llama-7b?
The main FastChat README references: Fine-tuning Vicuna-7B with Local GPUs
Writing this up as an "issue" but it's really more of a documentation request.
I'd like an example that fine tunes a Llama 2 model -- perhaps with at least a couple GPU hardware configs -- and it's been hard to find what command line setting should change and what is fine as-is.
Is Llama 2 similar enough that similar switches should work or is it fundamentally different in some way? Some of the switches are specific to the GPU involved; what's a good reference for this? Are there guidelines for memory/resource/time requirements/epochs or some rules of thumb available? Libraries and versions needed for FastChat seem only loosely spelled out. (Dependences lists torch without a version and some requirements like ninja and flash-attn weren't found until training time. https://github.com/lm-sys/FastChat/blob/main/pyproject.toml )
I assume some of this info might be spread out here and there but maybe the README could have more pointers to some of this (?)