Open jeswan opened 3 years ago
Comment by sleepinyourhat Thursday Jul 12, 2018 at 15:27 GMT
I'll argue that this is a low priority, since squishing this kind of variation gives only an illusion of stability. If we fix this problem and train a model twice, but with some tiny edit to the training set in one ('the' => 'a' in one example, maybe), we'd get similarly large differences.
Only do it if it's a crucial aid in some other debugging effort.
Issue by iftenney Thursday Jul 12, 2018 at 02:03 GMT Originally opened as https://github.com/nyu-mll/jiant/issues/140
Running demo.conf with eval_tasks = "sts-b,cola" produces different results on sts-b than running with eval_tasks = "sts-b".
Example commands:
Results are in
/nfs/jsalt/home/iftenney/eval_diff
, run at commit 1741812bdemo-base gets
sts-b_spearmanr: 0.685
, while demo-plus getssts-b_spearmanr: 0.679
. Not a large difference, but we should figure out why there's any interaction at all to rule out anything pernicious.I suspect this is due to RNG seeding causing different initialization (or data feeding?) when multiple models are initialized. We can test this by re-seeding the RNG (deterministically by task) before initializing the model for each task, instead of using the global seed.