Open Liu-M-H opened 4 weeks ago
It seems that you are not using the correct hyperparameter. Please check Table 15 in our paper for the hyperparameter used for all the RoBERTa experiments. Also, all the reported results are aggregated via grid search over multiple hyperparameter combinations as demonstrated in Table 15.
Thanks for your reply! Would you mind telling me the minimum number of iterations required for RoBERTa experiments with MeZO(LoRA)?
All the RoBERTa experiments were run with 100K steps. Using fewer than that may lead to very different results.
Hi, I use transformers==4.28.1 and torch==2.1.0
I run the following command:
My reproduced result is 72 but paper result is 84.
And I found that compared to FT, using Lora converges slower and more instable. So can you provide more details on MeZO(LORA), especially how many iterations are required to converge using Lora?