LoRA finetuning model didn't converge

The model did get converge and always output some repeated content.

Here is my hyper params, using 2 v100:

torchrun --nnodes 1 --nproc_per_node 2  train_fschat_lora_bc.py \
    --data_path ./data/train_data.json \
    --model_name_or_path checkpoints/baichuan-7B \
    --deepspeed configs/ds_zero2_offload.json --per_device_train_batch_size 1 \
    --output_dir out/fschat_bc --deepspeed configs/ds_zero2_offload.json \
    --fp16 --num_train_epochs 4 --lazy_preprocess \
    --gradient_accumulation_steps 16 \
    --learning_rate 5e-5 --weight_decay 0. \
    --warmup_ratio 0.03 --lr_scheduler_type "cosine" --model_max_length 512 \
    --resume=$resume

Does anybody knows why? This is very weired since I looked into many lora implementation, their lr actually almost like this..

lm-sys / FastChat

LoRA finetuning model didn't converge #1734