tloen / alpaca-lora

Instruct-tune LLaMA on consumer hardware
Apache License 2.0
18.62k stars 2.22k forks source link

The loss curve exhibits a stair-step pattern of descent. #502

Open s1ghhh opened 1 year ago

s1ghhh commented 1 year ago

I used the following setting to train my own dataset with lora, but I found that the loss curve exhibits a stair-step pattern of descent. It appears that the loss undergoes a significant drop at the end/start of each epoch. Furthermore, this phenomenon seems to be common, as I encountered the same issue when using the 7b model as the base model. Is this a problem with my parameter settings? Where should I start investigating this issue?

image

Here are my parameter settings:

base_model: ./vicuna-13b-v1.1
data_path: ***
output_dir: ./vicuna_13b
batch_size: 128
micro_batch_size: 16
num_epochs: 3
learning_rate: 0.0003
cutoff_len: 512
val_set_size: 0.05
lora_r: 128
lora_alpha: 256
lora_dropout: 0.1
lora_target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj', 'down_proj', 'gate_proj', 'up_proj']
train_on_inputs: True
group_by_length: False
wandb_project: 
wandb_run_name: 
wandb_watch: 
wandb_log_model: 
resume_from_checkpoint: False
prompt template: vicuna
trainable params: 500695040 || all params: 13516559360 || trainable%: 3.7043083721566257

Many thanks!

FHL1998 commented 1 year ago

Same issue here.Have you solved it?

s1ghhh commented 1 year ago

Same issue here.Have you solved it?

I did not solve this problem. Recently, I have some new discoveries: when I use the default code of FastChat, the loss curve is perfect. When I modify the data processing function, the loss curve descends in steps. I checked all parameters and it's the same in both cases. I think it's the data that causes the step down to happen. It is worth mentioning that the models obtained in both cases work very well. But I'm still curious what caused the step down. Have you solved this problem yet? Maybe we can discuss it.

s1ghhh commented 1 year ago

@FHL1998

IshootLaser commented 1 year ago

Hi, Does anyone notice that the eval loss diverge? I had many runs and most of them diverges. In some cases, the overfitted checkpoint produces better response (i.e. dulcet-shape-11 below, epoch 10 performs better than best epoch for some response, and is actually the best model out of all runs).

image