[Q] How to ensure reproducibility

huggingface / setfit

Efficient few-shot learning with Sentence Transformers

https://hf.co/docs/setfit

Apache License 2.0

2.25k stars 223 forks source link

[Q] How to ensure reproducibility #432

Closed youngjin-lee closed 11 months ago

youngjin-lee commented 1 year ago

Can someone explain how to ensure reproducibility of a pre-trained model ("sentence-transformers/paraphrase-mpnet-base-v2")?

I thought that the result would be reproducible because SetFitTrainer() has a default random seed in its constructor, but found that it was not the case. SetFitTrainer source code indicates that "to ensure reproducibility across runs, I need to use [~SetTrainer.model_init] function to instantiate the model". But, I don't understand what it entails.

Is there an example that I can follow?

Any help would be highly appreciated.

Thanks,

nitish1295 commented 1 year ago

For hugging face I usually refer to to the following for reproducibility:

I haven't looked into this too much but I am assuming some of this might also apply to SetFit

tomaarsen commented 12 months ago

Hello!

I was able to reproduce your findings, and have applied a fix (https://github.com/huggingface/setfit/pull/439/commits/5b39f062d1f3c4b684703af389c88806931b0681) in preparation for the upcoming SetFit v1.0.0 release. It will resolve this issue. If you wish to already use it, you can install the bleeding-edge development version via:

pip install git+https://github.com/huggingface/setfit.git@v1.0.0-pre

See the preliminary docs for v1.0.0 here.

Tom Aarsen