princeton-nlp / MeZO

[NeurIPS 2023] MeZO: Fine-Tuning Language Models with Just Forward Passes. https://arxiv.org/abs/2305.17333
MIT License
1.04k stars 63 forks source link

Cannot reproduce the results for Roberta-large on SNLI with MeZO(LORA) #39

Open Liu-M-H opened 4 weeks ago

Liu-M-H commented 4 weeks ago

Hi, I use transformers==4.28.1 and torch==2.1.0

I run the following command:

TASK=SNLI K=512 SEED=42 BS=64 LR=1e-4 EPS=1e-3 STEP=50000 MODEL=roberta-large EXTRA_TAG=lora bash mezo.sh --apply_lora --lora_r 8 --lora_alpha 16

My reproduced result is 72 but paper result is 84.

And I found that compared to FT, using Lora converges slower and more instable. So can you provide more details on MeZO(LORA), especially how many iterations are required to converge using Lora?

gaotianyu1350 commented 3 weeks ago

It seems that you are not using the correct hyperparameter. Please check Table 15 in our paper for the hyperparameter used for all the RoBERTa experiments. Also, all the reported results are aggregated via grid search over multiple hyperparameter combinations as demonstrated in Table 15.

Liu-M-H commented 3 weeks ago

Thanks for your reply! Would you mind telling me the minimum number of iterations required for RoBERTa experiments with MeZO(LoRA)?

gaotianyu1350 commented 3 weeks ago

All the RoBERTa experiments were run with 100K steps. Using fewer than that may lead to very different results.