Open lucasjinreal opened 1 year ago
The model did get converge and always output some repeated content.
Here is my hyper params, using 2 v100:
torchrun --nnodes 1 --nproc_per_node 2 train_fschat_lora_bc.py \ --data_path ./data/train_data.json \ --model_name_or_path checkpoints/baichuan-7B \ --deepspeed configs/ds_zero2_offload.json --per_device_train_batch_size 1 \ --output_dir out/fschat_bc --deepspeed configs/ds_zero2_offload.json \ --fp16 --num_train_epochs 4 --lazy_preprocess \ --gradient_accumulation_steps 16 \ --learning_rate 5e-5 --weight_decay 0. \ --warmup_ratio 0.03 --lr_scheduler_type "cosine" --model_max_length 512 \ --resume=$resume
Does anybody knows why? This is very weired since I looked into many lora implementation, their lr actually almost like this..
@lucasjinreal did you manage to make it converge? It's been a while, I hope it all worked out well for you here!
The model did get converge and always output some repeated content.
Here is my hyper params, using 2 v100:
Does anybody knows why? This is very weired since I looked into many lora implementation, their lr actually almost like this..