Can't reproduce the result of "Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks"

jiaweizzhao / GaLore

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Apache License 2.0

1.43k stars 148 forks source link

Can't reproduce the result of "Benchmark 2: Fine-Tuning RoBERTa on GLUE tasks" #25

Closed CrazyElements closed 7 months ago

CrazyElements commented 8 months ago

Has anyone successfully replicated the results of fine-tuning tasks? I followed the hyperparameters outlined in the REAMDE and the paper, and tried cola and mrpc tasks on a single GPU without gradient accumulation. However, the results I obtained differed from those reported in the paper. And here are best performances of my runs

mrpc: 0.8971(92.25)
cola: 0.6274(0.6035), where numbers in parentheses are the results of the paper. I would appreciate any assistance from someone who can provide insights on this matter.

jiaweizzhao commented 8 months ago

I am happy to help @CrazyElements
Can you provide the full training script and hyperparameters you are using? You can also join our slack for quick discussion

CrazyElements commented 8 months ago

Thanks @jiaweizzhao. For mrpc, I just used the hyperparameters listed in README.

python run_glue.py \
    --model_name_or_path roberta-base \
    --task_name mrpc \
    --enable_galore \
    --lora_all_modules \
    --max_length 512 \
    --seed=1234 \
    --lora_r 4 \
    --galore_scale 4 \
    --per_device_train_batch_size 16 \
    --update_proj_gap 500 \
    --learning_rate 3e-5 \
    --num_train_epochs 30 \
    --output_dir results/ft/roberta_base/mrpc

For the other tasks, I modified learning_rate and num_train_epochs. And I trained with run_glue.py.

jiaweizzhao commented 8 months ago

I tried and it works as expected. The issue might be we report F1 score of mrpc in the paper and causes the confusion. I will change it back to accuracy in the new revision.

CrazyElements commented 8 months ago

Thank you for your response. But I'm still unable to replicate the results. The final f1 score of mrpc is 91.93, and the matthews_correlation of cola is 59.6. By the way, did you use the results of the eval dataset of the last epoch as the final outcomes? The above results I mentioned extracted from all_result.json, which acctually corresponds to the eval dataset results of the last epoch.

jiaweizzhao commented 8 months ago

This might be due to the choice of the random seed. I did a quick sweep using my previous setup (based on the config you provided): python run_glue.py \ --model_name_or_path roberta-base \ --task_name mrpc \ --enable_galore \ --lora_all_modules \ --max_length 512 \ --seed=1234 \ --lora_r 4 \ --galore_scale 16 \ --per_device_train_batch_size 32 \ --update_proj_gap 500 \ --learning_rate 2e-5 \ --num_train_epochs 20 \ --output_dir results/ft/roberta_base/mrpc This gives {"eval_accuracy": 0.8970588235294118, "eval_f1": 0.925531914893617}

CrazyElements commented 8 months ago

This might be due to the choice of the random seed

So I wonder if you use a different seed(not 1234)? Maybe I mistakenly assumed that the example script in the README would yield the same results. If indeed you did, would it be possible for you to consider open-sourcing the fine-tuning script?

--galore_scale 16 \ --per_device_train_batch_size 32 \

And here I used hyperparameters listed in table 7.

jiaweizzhao commented 7 months ago

We use the avg score of repeated runs. We will release the fine-tuning scripts later, along with a few more fine-tuning experiments.