If have to sample from y instead of use an exact expectation, then definitely look at the impact of the number of samples.
Task2vec had some sort of variational thing
SWAG has something on approximating the loss geometry and the posterior via the training trajectory.
Regularizers during fine-tuning
None
Isotropic L2 distance from pretrained checkpoint.
EWC L2 distance from pretrained checkpoint.
Quick note that I may have seen some performance gains in traditional single-task fine-tuning a while back. Perhaps something to look out for.
Effect of number of fine-tuning steps on merging performance/boost.
Picking best checkpoint for each task vs fine-tuning for fixed number of steps.
See what happens when we have a large data set (maybe QQP) where we need to fine-tune for a long time for optimal performance. (Probably explore this in concert with regularization.)
Scaling
BERT-base vs BERT-large (and perhaps BERT-small)
Different sized SimCLR resnets.
Choosing model weightings when merging.
Merging is very fast, so brute force grid search might be enough.
Grid search is exponential in the number of tasks, so possibly explore alternatives when doing many-way merge.
Choosing which models to merge.
See how task and domain similarity affects the boost and performance.
See if any models tend to be "good" in general to merge with and if any tend to be "bad" in general.
Choosing weightings for many model merge could satisfy this.
Initializing fine-tuning with merged weights.
Single-task fine-tuning.
Will need further fine-tuning of the checkpoint (with no merging) as a baseline.
Multi-task fine-tuning
Going for single-task vs multi-task performance?
Informing the mixing proportions via the merging weightings.
EWC regularizer from the merged checkpoints.
Use same weighting as used in the merge?
Try only using the target task regularizer.
Freezing the body and fine-tuning classification heads to get a baseline for what improvement we could get by updating the heads during the merge.
Informing mixing proportions
See if optimal task weightings from merging can inform optimal mixing proportions for traditional multi-task fine-tuning.
Undoing catastrophic forgetting
Add more details here, but the basic idea is that we merge a model fine-tuned from a checkpoint with the original checkpoint. Can also try with MLM checkpoint for goal of performance boost on fine-tuned task.
Baselines
Performance of model we are merging from (single-task baseline).
Freezing the body? (Needed if looking at multi-task performance. Not needed if looking at single-task boost.)
Multi-task fine-tuning.
Mixing proportions?
Applicability
Transformer on GLUE.
SimCLR on image tasks.
Pretrained ImageNet model? (less likely)
Performance boost on SoTA models
Using T5 as an example, take the SoTA fine-tuned checkpoints, compute their Fishers, and then do our merging procedure to boost performance.
Ideally, we'd have access to the SoTA fine-tuned checkpoints.
Other
Release common checkpoints with Fishers (or just the Fisher if the checkpoint is public)?
Community site for people to release?
Create easy to use script for people to try this on their models.
Exploration of our methods
Different approximations
Methods of computing the Fisher
Regularizers during fine-tuning
Effect of number of fine-tuning steps on merging performance/boost.
Scaling
Choosing model weightings when merging.
Choosing which models to merge.
Initializing fine-tuning with merged weights.
Informing mixing proportions
Undoing catastrophic forgetting
Baselines
Applicability
Performance boost on SoTA models
Other