huggingface / setfit

Efficient few-shot learning with Sentence Transformers
https://hf.co/docs/setfit
Apache License 2.0
2.22k stars 220 forks source link

Confusion about correct learning rate when running contrastive fine-tuning #208

Closed mfluegge closed 1 year ago

mfluegge commented 1 year ago

Hey,

first off - great work, I really like the idea of SetFit! I like it so much in fact that I want to reproduce it :grin: . In the paper you mention that you use a learning rate of 1e-3 for all the few-shot experiments. When looking at scripts/setfit/run_fewshot.py I can see that this is in fact the argparse default value for the lr parameter https://github.com/huggingface/setfit/blob/fa1021d2355f0cb3a2c85732ee7ffe44b0cef0d1/scripts/setfit/run_fewshot.py#L50 However, when the trainer instance is created or train is called, the lr parameter is not actually passed, https://github.com/huggingface/setfit/blob/fa1021d2355f0cb3a2c85732ee7ffe44b0cef0d1/scripts/setfit/run_fewshot.py#L121-L131 https://github.com/huggingface/setfit/blob/fa1021d2355f0cb3a2c85732ee7ffe44b0cef0d1/scripts/setfit/run_fewshot.py#L143-L144 meaning that the trainer will fall back to the default value for that parameter, which is actually 2e-5. https://github.com/huggingface/setfit/blob/fa1021d2355f0cb3a2c85732ee7ffe44b0cef0d1/src/setfit/trainer.py#L87 In my local implementation of SetFit, 2e-5 also seems to work a lot better than 1e-3, which can lead to extremely unstable results. Just wanted to confirm that 2e-5 would be the recommended value to use rather than 1e-3?

Thanks in advance!

tomaarsen commented 1 year ago

Hello!

Well spotted! That is a bug in setfit/scripts/setfit/run_fewshot.py. In my personal experiments, values around 2e-5 have worked the best. Using trainer.apply_hyperparameters, I found that 3e-5 works well for me. My quick tests just now with 1e-3 only produce very poor results. For a binary classification task using the sst2 dataset, 2e-5 produces an accuracy of approximately 85%, while a learning rate of around 1e-3 gives an accuracy of ~50%, which is obviously useless in a binary classification task.

To reproduce the paper results, I would advise sticking with 2e-5.

@lewtun I believe this implies that the learning rate mentioned in the paper is incorrect, sadly.

tomaarsen commented 1 year ago

I believe this has been cleared up now, so I'll close this :)