r-three / t-few

Code for T-Few from "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"
MIT License
429 stars 59 forks source link

Validation score on WSC decreases with training #18

Closed Raldir closed 2 years ago

Raldir commented 2 years ago

Thank you for the amazing work on t-few! I've noticed strange behavior when I am running superglue's wsc. I've been logging the validation score every 40 epochs using self.eval_epoch_interval = 40 and when running the command: python -m src.pl_train -c ia3.json+wsc.json -k save_model=False exp_name=first_exp the output is as following:

{"accuracy": 0.6730769230769231, "score_gt": 0.5068197436630726, "score_cand": 0.7191649047801127} {"accuracy": 0.49038461538461536, "score_gt": 1.4563168384707892, "score_cand": 1.505529030584372} {"accuracy": 0.47115384615384615, "score_gt": 3.4743554890155792, "score_cand": 2.727144861450562} {"accuracy": 0.46153846153846156, "score_gt": 4.202766236777489, "score_cand": 3.5702959763316007} {"accuracy": 0.40384615384615385, "score_gt": 5.157541000499175, "score_cand": 3.5657502871293287} {"accuracy": 0.3942307692307692, "score_gt": 5.397989429533482, "score_cand": 3.975659689651086} {"accuracy": 0.40384615384615385, "score_gt": 5.073869264469697, "score_cand": 3.995581218542961}

The last accuracy score is reported at 240 epochs out of a total 250 epochs.

Any ideas on what is going on here? Thanks!

HaokunLiu commented 2 years ago

I can try running this experiment maybe later half of this week. Meanwhile, I remember WSC to be a tricky dataset that often produces unstable results. Would you mind running it with a few other seeds and seeing if this behavior persists?

HaokunLiu commented 2 years ago

And btw, is this just WSC? Do other datasets have this problem?

Raldir commented 2 years ago

Hi Haokun, thank you for the response. Indeed, after changing the seed results are more as expected. I have been having similar problems with WiC, but again it appears to be caused by the variability of the seed. RTE seems more stable.