Closed CrazyElements closed 7 months ago
I am happy to help @CrazyElements
Can you provide the full training script and hyperparameters you are using?
You can also join our slack for quick discussion
Thanks @jiaweizzhao. For mrpc, I just used the hyperparameters listed in README.
python run_glue.py \
--model_name_or_path roberta-base \
--task_name mrpc \
--enable_galore \
--lora_all_modules \
--max_length 512 \
--seed=1234 \
--lora_r 4 \
--galore_scale 4 \
--per_device_train_batch_size 16 \
--update_proj_gap 500 \
--learning_rate 3e-5 \
--num_train_epochs 30 \
--output_dir results/ft/roberta_base/mrpc
For the other tasks, I modified learning_rate
and num_train_epochs
. And I trained with run_glue.py.
I tried and it works as expected. The issue might be we report F1 score of mrpc in the paper and causes the confusion. I will change it back to accuracy in the new revision.
Thank you for your response. But I'm still unable to replicate the results. The final f1 score of mrpc is 91.93, and the matthews_correlation of cola is 59.6.
By the way, did you use the results of the eval dataset of the last epoch as the final outcomes? The above results I mentioned extracted from all_result.json
, which acctually corresponds to the eval dataset results of the last epoch.
This might be due to the choice of the random seed. I did a quick sweep using my previous setup (based on the config you provided):
python run_glue.py \ --model_name_or_path roberta-base \ --task_name mrpc \ --enable_galore \ --lora_all_modules \ --max_length 512 \ --seed=1234 \ --lora_r 4 \ --galore_scale 16 \ --per_device_train_batch_size 32 \ --update_proj_gap 500 \ --learning_rate 2e-5 \ --num_train_epochs 20 \ --output_dir results/ft/roberta_base/mrpc
This gives {"eval_accuracy": 0.8970588235294118, "eval_f1": 0.925531914893617}
This might be due to the choice of the random seed
So I wonder if you use a different seed(not 1234)? Maybe I mistakenly assumed that the example script in the README would yield the same results. If indeed you did, would it be possible for you to consider open-sourcing the fine-tuning script?
--galore_scale 16 \ --per_device_train_batch_size 32 \
And here I used hyperparameters listed in table 7.
We use the avg score of repeated runs. We will release the fine-tuning scripts later, along with a few more fine-tuning experiments.
Has anyone successfully replicated the results of fine-tuning tasks? I followed the hyperparameters outlined in the REAMDE and the paper, and tried cola and mrpc tasks on a single GPU without gradient accumulation. However, the results I obtained differed from those reported in the paper. And here are best performances of my runs