First paper experimental outline

Exploration of our methods

Different approximations

Diagonal for sure.
Try KFC for conv nets.
See if we can adapt K-FAC for transformers.
Diagonal + low rank
Methods of computing the Fisher
Direct calculation
- Number of examples to use
- If have to sample from y instead of use an exact expectation, then definitely look at the impact of the number of samples.
Task2vec had some sort of variational thing
SWAG has something on approximating the loss geometry and the posterior via the training trajectory.
Regularizers during fine-tuning
None
Isotropic L2 distance from pretrained checkpoint.
EWC L2 distance from pretrained checkpoint.
- Quick note that I may have seen some performance gains in traditional single-task fine-tuning a while back. Perhaps something to look out for.
  Effect of number of fine-tuning steps on merging performance/boost.
Picking best checkpoint for each task vs fine-tuning for fixed number of steps.
See what happens when we have a large data set (maybe QQP) where we need to fine-tune for a long time for optimal performance. (Probably explore this in concert with regularization.)
Scaling
BERT-base vs BERT-large (and perhaps BERT-small)
Different sized SimCLR resnets.
Choosing model weightings when merging.
Merging is very fast, so brute force grid search might be enough.
Grid search is exponential in the number of tasks, so possibly explore alternatives when doing many-way merge.
Choosing which models to merge.
See how task and domain similarity affects the boost and performance.
See if any models tend to be "good" in general to merge with and if any tend to be "bad" in general.
- Choosing weightings for many model merge could satisfy this.
  Initializing fine-tuning with merged weights.
Single-task fine-tuning.
- Will need further fine-tuning of the checkpoint (with no merging) as a baseline.
Multi-task fine-tuning
- Going for single-task vs multi-task performance?
- Informing the mixing proportions via the merging weightings.
EWC regularizer from the merged checkpoints.
- Use same weighting as used in the merge?
- Try only using the target task regularizer.
Freezing the body and fine-tuning classification heads to get a baseline for what improvement we could get by updating the heads during the merge.
Informing mixing proportions
See if optimal task weightings from merging can inform optimal mixing proportions for traditional multi-task fine-tuning.

Undoing catastrophic forgetting

Add more details here, but the basic idea is that we merge a model fine-tuned from a checkpoint with the original checkpoint. Can also try with MLM checkpoint for goal of performance boost on fine-tuned task.

Baselines

Performance of model we are merging from (single-task baseline).
Freezing the body? (Needed if looking at multi-task performance. Not needed if looking at single-task boost.)
Multi-task fine-tuning.
- Mixing proportions?

Applicability

Transformer on GLUE.
SimCLR on image tasks.
Pretrained ImageNet model? (less likely)

Performance boost on SoTA models

Using T5 as an example, take the SoTA fine-tuned checkpoints, compute their Fishers, and then do our merging procedure to boost performance.
Ideally, we'd have access to the SoTA fine-tuned checkpoints.

Other

Release common checkpoints with Fishers (or just the Fisher if the checkpoint is public)?
- Community site for people to release?
Create easy to use script for people to try this on their models.

mmatena / m251

First paper experimental outline #12

Exploration of our methods

Different approximations

Methods of computing the Fisher

Regularizers during fine-tuning

Effect of number of fine-tuning steps on merging performance/boost.

Scaling

Choosing model weightings when merging.

Choosing which models to merge.

Initializing fine-tuning with merged weights.

Informing mixing proportions

Undoing catastrophic forgetting

Baselines

Applicability

Performance boost on SoTA models

Other