Llama 2 70b qLoRA training not converging

Hi folks,

I'm running into an issue finetuning the 70B Llama 2 model with 4bit qLoRA using the FastChat package, and I'm wondering if anyone else has encountered similar issues or has suggestions for a fix. Briefly, here's my command to train, based on train_lora.sh script included with FastChat:

deepspeed fastchat/train/train_lora.py \
    --model_name_or_path meta-llama/Llama-2-70b-hf  \
    --lora_r 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --data_path ~/data.json \
    --output_dir ~/.checkpoints \
    --num_train_epochs 3 \
    --bf16 True \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "steps" \
    --eval_steps 100 \
    --save_strategy "steps" \
    --save_steps 200 \
    --save_total_limit 2 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_strategy "steps" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 4096 \
    --q_lora True \
    --deepspeed playground/deepspeed_config_s2.json \
    --gradient_checkpointing True \
    --flash_attn True \
    --lazy_preprocess True

One notable change is that in train_lora.py, I import llama2_flash_attn_monkey_patch instead of llama_flash_attn_monkey_patch.

When I ran my training job, I noticed the output of the model was quite poor. Among other things, it wasn't properly stopping at the end of messages from the Assistant, and instead would continue to generate full conversations after one input message from the User at inference time. I noticed the loss didn't converge as well as the non-qLoRA jobs I've run, and instead was oscillating around 0.6-1.0 during epochs 2 and 3, when it should usually decrease to around 0.2-0.3 at the end of 3 epochs.

Has anyone encountered similar issues? If so, how did you solve them? Thanks in advance!

lm-sys / FastChat

Llama 2 70b qLoRA training not converging #2578