nyu-mll / jiant-v1-legacy

The jiant toolkit for general-purpose text understanding models
MIT License
21 stars 9 forks source link

Eval tasks not fully independent #140

Open jeswan opened 3 years ago

jeswan commented 3 years ago

Issue by iftenney Thursday Jul 12, 2018 at 02:03 GMT Originally opened as https://github.com/nyu-mll/jiant/issues/140


Running demo.conf with eval_tasks = "sts-b,cola" produces different results on sts-b than running with eval_tasks = "sts-b".

Example commands:

python main.py -c config/demo.conf -o 'exp_name = demo-base'  # evals on sts-b only
python main.py -c config/demo.conf -o 'exp_name = demo-plus, eval_tasks = "sts-b,cola"'

Results are in /nfs/jsalt/home/iftenney/eval_diff, run at commit 1741812b

demo-base gets sts-b_spearmanr: 0.685, while demo-plus gets sts-b_spearmanr: 0.679. Not a large difference, but we should figure out why there's any interaction at all to rule out anything pernicious.

I suspect this is due to RNG seeding causing different initialization (or data feeding?) when multiple models are initialized. We can test this by re-seeding the RNG (deterministically by task) before initializing the model for each task, instead of using the global seed.

jeswan commented 3 years ago

Comment by sleepinyourhat Thursday Jul 12, 2018 at 15:27 GMT


I'll argue that this is a low priority, since squishing this kind of variation gives only an illusion of stability. If we fix this problem and train a model twice, but with some tiny edit to the training set in one ('the' => 'a' in one example, maybe), we'd get similarly large differences.

Only do it if it's a crucial aid in some other debugging effort.