I added TrackEvalMetrics callback so every time trainer.evaluate() is called, the callback stores evaluation metrics.
I set the default callbacks in finetuning.py to [TrackEvalMetrics, EarlyStoppingCallback].
I adjusted TaskResults class in run_utils.py to handle combinations of the above callbacks while staying backward compatible. If TrackEvalMetrics, then the reduce method involves reducing over runs, and over evaluation calls within a run.
Early stopping requires load_best_model_at_end, metric_for_best_model, evaluation_strategy, so I updated the finetuning configs to have defaults for these. Eval_steps and early_stopping patience should be set manually.
export_finetuning_results.py works as before, but if you unpickle the task_results.p file, you'll have access to the time series of evaluation metrics.
In the special case on mnli finetuning, there are two eval sets. Primitive way to deal with this for now is to track the "matched" eval set metrics during training, and evaluated the "mismatched" set once at the end of training. The metric_key_prefix argument is set to "mm" in this case, so TrackEvalMetrics can recognize this one case and handle with a separate data structure. If early stopping is enabled, note that load_best_model_at_end is also enabled, so the correct model will be evaluated on the mismatched set either way.