Loss reaches 0 when finetuning 7B model using 1xA100 80G

rootally commented 12 months ago

I'm using the config below and I load the base model as torch.float16

--model_name_or_path llama_model --data_path data.json --bf16 True --num_train_epochs $3 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_steps 16 --evaluation_strategy "no" --save_strategy "steps" --save_steps 1200 --save_total_limit 3 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --model_max_length 2048 --gradient_checkpointing True --lazy_preprocess True --report_to tensorboard

gjmulder commented 12 months ago

Are you talking about eval set loss or training loss?
Plot both as a function of epoch similar to #63 to see whether you are overfitting or underfitting
How large is your data set?
How many epochs is $3 set to?

rootally commented 11 months ago

@gjmulder thanks for getting back.

training loss
the loss will actually go to 0 in the second step itself and doesn't recover
the dataset is around 100mb
3 epochs

gjmulder commented 11 months ago

Without a plot it is difficult to say for certain, but you are probably overfitting. Don't train for more than one epoch.

openlm-research / open_llama

Loss reaches 0 when finetuning 7B model using 1xA100 80G #75