princeton-nlp / MeZO

[NeurIPS 2023] MeZO: Fine-Tuning Language Models with Just Forward Passes. https://arxiv.org/abs/2305.17333
MIT License
1.05k stars 63 forks source link

Full finetuning with Roberta-Large #40

Open aparna-aketi opened 1 month ago

aparna-aketi commented 1 month ago

I want to run a full fine-tuning with RoBERTa-large. The readme file suggest to use following command

# Adam fine-tuning
TASK=SST-2 K=16 SEED=42 BS=8 LR=1e-5 MODEL=roberta-large bash finetune.sh

However, the type parameter is set to TYPE:-"prompt". Shouldn't this be set to "finetune"?

gaotianyu1350 commented 1 month ago

Hi,

Here "prompt" just means to prompt-based fine-tuning (https://arxiv.org/abs/2012.15723), a very standard way to fine-tuning language models nowadays.

aparna-aketi commented 1 month ago

Hi, Thanks for the response. Just for clarification, in figure 2 of the MeZO paper, does FT correspond to full fine-tuning or prompt-based fine-tuning? I want to reproduce the results corresponding to that figure.

gaotianyu1350 commented 1 month ago

Hi, everything we report is prompt-based fine-tuning, since that provides much better performance

aparna-aketi commented 1 month ago

Okay, thanks for clarification. One more question: mezo.sh file has the steps to be 100k and run_fewshot.sh has 1000 steps. In the figure 2, is MeZO run for 100x more steps than FT. Is that correct? It doesn't seem like a fair comparison as MeZO uses 100x more number of steps than FT. Even if we consider the backward pass to be 2x more expensive than forward pass, we should be using 3x steps for MeZO to do a fair comparison with FT. It would be great if you could provide some insights here.

gaotianyu1350 commented 3 weeks ago

Hi,

Yes MeZO is run with 100x more steps than FT. It is not a fair comparison in terms of wall clock time. The Roberta-large experiments are mainly to showcase that it is possible to train models without backpropagation (which saves a lot of memory).