lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.31k stars 4.47k forks source link

LoRA finetuning model didn't converge #1734

Open lucasjinreal opened 1 year ago

lucasjinreal commented 1 year ago

The model did get converge and always output some repeated content.

Here is my hyper params, using 2 v100:

torchrun --nnodes 1 --nproc_per_node 2  train_fschat_lora_bc.py \
    --data_path ./data/train_data.json \
    --model_name_or_path checkpoints/baichuan-7B \
    --deepspeed configs/ds_zero2_offload.json --per_device_train_batch_size 1 \
    --output_dir out/fschat_bc --deepspeed configs/ds_zero2_offload.json \
    --fp16 --num_train_epochs 4 --lazy_preprocess \
    --gradient_accumulation_steps 16 \
    --learning_rate 5e-5 --weight_decay 0. \
    --warmup_ratio 0.03 --lr_scheduler_type "cosine" --model_max_length 512 \
    --resume=$resume

Does anybody knows why? This is very weired since I looked into many lora implementation, their lr actually almost like this..

surak commented 10 months ago

@lucasjinreal did you manage to make it converge? It's been a while, I hope it all worked out well for you here!