Created a mixin where trainer.evaluate() loops over a list of evaluation sets
Refactored and fixed various checks, including new checks to ensure the above mixin is used when finetuning mnli
Making a good default behavior for handling mnli required several adjustments, including new run_args and updates to task_hyperparams updates in run.py
Added configs for numerous experiments
Removed some configs that are no longer in use to avoid confusion, added labels
Gathered hyperparameters for small berts
Added hyperparam analysis code so that hps for one task serve as a proxy for another
Updated the prediction and link predictions components of run_utils to handle multiple test sets