First paper experiment logs

mmatena commented 3 years ago

The idea is that I'll create a comment with a description and results from each experiment. I can move experiments into their own issue if so desired and replace their comment here with a link to the new issue.

mmatena commented 3 years ago

Impact of L2 regularization strength on fine-tuning performance

Set up

Done on RoBERTa-large, trained with batch size of 8 for 200k examples. Scores for best checkpoint. Single run.

Caveats

High resource tasks were undertrained.
There appeared to be some instability in training.

Results

λ L2	cola	mnli	mrpc	qnli	qqp	rte	sst2	stsb
*0.0*	65.7	87.5	90.7	91.9	86.9	83.8	95.3	90.8
*0.0003*	65.6	87.5	90.1	92.2	87.3	83.4	96.0	90.3
*0.01*	0.0	87.4	91.4	91.9	75.6	86.6	95.5	90.6

Notes

CoLA's 0.01 score might have been due to instability. (I haven't looked further into this.)
QQP's 0.01 score might have been actual effect or due to instability. (I haven't looked further into this.)

Take-aways

A regularization strength of 0.0003 for isotropic L2 regularization had roughly no negative or positive impact on performance.
A regularization strength of 0.01 for isotropic L2 regularization may have boosted performance on some low resource tasks. It may have hurt performance on the high-resource QQP and potentially increased training instability.

mmatena commented 3 years ago

Impact of L2 regularization strength on relative performance on merging

Set up

RoBERTa-large
Batch size of 8 for 200k examples.
Compared best original checkpoint with best checkpoint merged from that one and a another checkpoint.

Caveats

Everything was trained for at most 200k examples. Regularizers might become more important in longer training regimes.
There appeared to be some instability in training.

Results

λ L2	cola	mnli	mrpc	qnli	qqp	rte	sst2	stsb	Average
*0.0*	100.1	100.6	101.0	100.8	100.4	102.6	101.1	100.0	100.8
*0.0003*	99.6	100.9	100.7	100.5	100.2	102.6	100.4	99.9	100.6
*0.01*	N/A	100.5	100.0	100.6	100.3	102.1	100.8	100.0	100.6

Take-aways

For regularization strengths of 0.01 or lower, it appears that isotropic L2 regularization has no effect on merging performance.

mmatena commented 3 years ago

[PHASE I] Impact of EWC regularization strength on fine-tuning performance

Goal

Preliminary experiment to get an idea for the range of regularization strengths needed for EWC.

Set up

BERT-base
Only SST-2.
Batch size of 16 for 131k examples (about two epochs).

Caveats

The results will depend on the scale of the Fisher matrix itself, so they may not necessarily be valid for other models.
Results may vary for low and very high resource tasks as SST-2 is medium to high resource with 67k train examples.
- Low-resource tasks might benefit from stronger regularization.
- Very high-resource tasks might be more negatively impacted by stronger regularization.
Results might be different when training for more epochs. I trained for about two here.

Results

Accuracy of best checkpoint on SST-2

	0.0	1e-05	0.0001	0.001	0.01	0.1	1.0	10.0	100.0
*train*	96.6	96.3	96.5	96.6	96.5	96.3	96.4	95.3	92.2
*dev*	92.9	92.8	92.8	92.9	92.4	92.9	92.5	92.1	91.7

Take-aways

EWC regularization strengths of 1.0 and below did not affect train or validation accuracy of SST-2.
EWC regularization strengths of 10 and 100 hurt SST-2 validation accuracy slightly to modestly while train accuracy was more significantly impacted, especially with the 100 strength.

mmatena commented 3 years ago

Impact of regularization on fine-tuning performance

Set-up

BERT-base
Batch size of 16.
Trained for 4 epochs, saved checkpoints every 1/2 epoch.
Results for best checkpoint by validation performance.
Tried EWC and isotropic L2 regularization.

Caveats

TODO

Results

Isotropic L2

	cola	mnli	mrpc	qnli	qqp	rte	sst2	stsb	Average
*None*	57.9	83.8	83.5	90.6	89.7	66.1	92.7	86.0	81.3
*0.0003*	59.4	83.8	84.9	90.6	90.4	63.5	92.3	83.9	81.1
*0.01*	59.5	81.6	83.0	90.0	85.7	60.6	92.0	85.3	79.7
*0.1*	53.6	75.4	83.3	86.0	81.3	61.0	91.7	82.8	76.9

EWC

	cola	mnli	mrpc	qnli	qqp	rte	sst2	stsb	Average
*None*	57.9	83.8	83.5	90.6	89.7	66.1	92.7	86.0	81.3
*0.03*	58.4	83.9	82.2	90.6	90.2	63.5	92.9	85.4	80.9
*1.0*	57.4	83.8	83.2	90.6	89.6	61.7	92.3	83.8	80.3
*10.0*	58.5	82.4	86.4	90.3	87.2	63.2	92.3	84.0	80.5
*100.0*	57.6	76.9	83.3	87.5	83.4	61.0	91.5	83.2	78.1

Takeaways

TODO

mmatena commented 3 years ago

Impact of regularization on informed pairwise merging absolute performance

Results

Isotropic L2

Performance

iso	cola	mnli	mrpc	qnli	qqp	rte	sst2	stsb	Average
*0.0*	59.6	84.2	83.3	90.9	89.5	66.4	93.0	85.6	81.6
*0.0003*	59.3	84.9	N/A	89.5	89.1	70.8	92.1	81.3	N/A
*0.01*	59.9	81.4	83.7	89.5	85.5	69.7	92.2	84.7	80.8
*0.1*	53.4	75.1	83.2	85.8	81.0	63.9	91.3	81.5	76.9

Change from original

iso	cola	mnli	mrpc	qnli	qqp	rte	sst2	stsb
*0.0*	1.7	0.4	-0.2	0.3	-0.2	0.4	0.3	-0.4
*0.0003*	-0.1	0.0	N/A	-1.0	-1.2	7.2	-0.2	-2.6
*0.01*	0.4	-0.2	0.6	-0.5	-0.2	9.0	0.2	-0.5
*0.1*	-0.3	0.0	-0.1	-0.2	-0.3	2.9	-0.5	-1.2

EWC

Performance

ewc	cola	mnli	mrpc	qnli	qqp	rte	sst2	stsb	Average
*0.0*	59.6	84.2	83.3	90.9	89.5	66.4	93.0	85.6	81.6
*0.03*	58.3	83.8	83.9	90.9	90.3	66.1	92.2	85.6	81.4
*1.0*	58.0	84.0	83.8	90.4	89.5	68.6	92.9	83.5	81.3
*10.0*	57.9	82.7	86.8	90.1	86.9	72.2	92.3	84.2	81.6
*100.0*	57.4	76.8	83.7	87.0	82.9	66.4	91.3	84.3	78.7

Change from original

ewc	cola	mnli	mrpc	qnli	qqp	rte	sst2	stsb
*0.0*	1.7	0.4	-0.2	0.3	-0.2	0.4	0.3	-0.4
*0.03*	-0.1	-0.1	1.7	0.3	0.0	2.5	-0.7	0.2
*1.0*	0.6	0.2	0.6	-0.2	-0.1	6.9	0.6	-0.3
*10.0*	-0.6	0.3	0.4	-0.1	-0.3	9.0	0.0	0.1
*100.0*	-0.1	-0.1	0.4	-0.6	-0.5	5.4	-0.2	1.0

mmatena commented 3 years ago

Merging best RTE checkpoint with each MNLI checkpoint from training

Results

Columns are MNLI checkpoint indices. Each index corresponds to training on half an epoch of MNLI.

Isotropic L2

Performance

iso	0	1	2	3	4	5	6	7
*0.0*	66.4	66.8	66.1	66.4	66.4	66.1	66.4	66.1	66.4
*0.0003*	71.1	72.2	70.8	72.6	74.0	71.8	70.8	70.8	70.8
*0.01*	66.8	68.6	70.0	67.9	67.9	67.9	69.7	69.7	70.4
*0.1*	60.6	59.9	60.3	61.4	61.4	62.1	63.9	63.9	63.9

Change from original

iso	0	1	2	3	4	5	6	7
*0.0*	0.4	0.7	0.0	0.4	0.4	0.0	0.4	0.0	0.4
*0.0003*	7.6	8.7	7.2	9.0	10.5	8.3	7.2	7.2	7.2
*0.01*	6.1	7.9	9.4	7.2	7.2	7.2	9.0	9.0	9.7
*0.1*	-0.4	-1.1	-0.7	0.4	0.4	1.1	2.9	2.9	2.9

EWC

Performance

ewc	0	1	2	3	4	5	6	7
*0.0*	66.4	66.8	66.1	66.4	66.4	66.1	66.4	66.1	66.4
*0.03*	63.9	64.3	66.1	66.1	64.6	64.6	62.5	63.2	62.1
*1.0*	68.6	67.5	69.0	66.8	68.6	68.6	67.5	70.0	68.2
*10.0*	68.2	70.0	71.1	70.4	72.2	72.2	71.1	71.1	72.9
*100.0*	65.0	65.7	66.4	66.4	65.0	64.6	65.3	64.3	63.2

Change from original

ewc	0	1	2	3	4	5	6	7
*0.0*	0.4	0.7	0.0	0.4	0.4	0.0	0.4	0.0	0.4
*0.03*	0.4	0.7	2.5	2.5	1.1	1.1	-1.1	-0.4	-1.4
*1.0*	6.9	5.8	7.2	5.1	6.9	6.9	5.8	8.3	6.5
*10.0*	5.1	6.9	7.9	7.2	9.0	9.0	7.9	7.9	9.7
*100.0*	4.0	4.7	5.4	5.4	4.0	3.6	4.3	3.2	2.2

mmatena commented 3 years ago

Merging best MRPC checkpoint with each MNLI checkpoint from training

Results

Columns are MNLI checkpoint indices. Each index corresponds to training on half an epoch of MNLI.

Isotropic L2

Performance

iso	0	1	2	3	4	5	6	7
*0.0*	83.6	83.9	83.4	83.2	83.6	83.6	83.4	83.2
*0.0003*	85.3	85.5	85.1	85.1	84.9	84.7	84.5	84.7
*0.01*	83.1	83.1	82.9	83.1	83.1	83.3	82.7	83.1
*0.1*	83.5	82.9	82.9	83.8	83.2	83.3	83.2	84.0

Change from original

iso	0	1	2	3	4	5	6	7
*0.0*	0.1	0.4	-0.1	-0.3	0.1	0.0	-0.1	-0.4
*0.0003*	0.4	0.6	0.2	0.2	0.0	-0.2	-0.3	-0.2
*0.01*	0.1	0.0	-0.1	0.0	0.0	0.2	-0.3	0.0
*0.1*	0.2	-0.4	-0.4	0.5	-0.1	0.0	-0.1	0.7

EWC

Performance

ewc	0	1	2	3	4	5	6	7
*0.0*	83.6	83.9	83.4	83.2	83.6	83.6	83.4	83.2
*0.03*	83.8	83.7	83.3	83.2	83.3	83.4	83.2	83.3
*1.0*	84.0	84.3	83.5	83.7	83.8	83.7	83.2	83.7
*10.0*	87.3	87.3	86.7	87.2	86.3	87.3	87.1	87.2
*100.0*	82.0	82.4	82.3	83.1	83.1	83.1	83.1	83.1

Change from original

ewc	0	1	2	3	4	5	6	7
*0.0*	0.1	0.4	-0.1	-0.3	0.1	0.0	-0.1	-0.4
*0.03*	1.6	1.4	1.1	1.0	1.0	1.2	1.0	1.0
*1.0*	0.8	1.2	0.3	0.5	0.6	0.5	0.0	0.6
*10.0*	0.9	0.9	0.3	0.9	-0.1	0.9	0.7	0.9
*100.0*	-1.3	-0.9	-1.0	-0.2	-0.2	-0.2	-0.2	-0.2

mmatena commented 3 years ago

Merging all checkpoints of an RTE run with all checkpoints of an MNLI run

Both runs has isotropic L2 regularization with a strength of 0.0003. Each checkpoint index corresponds to another half epoch of training, which consists of a different number of examples for each task.

Results

Performance

v target \ donor >	0	1	2	3	4	5	6	7
0	69.7	70.4	73.3	69.3	73.6	68.2	71.5	70.0
1	71.1	72.6	74.7	71.5	74.4	73.3	72.2	70.0
2	71.5	71.1	74.4	71.8	73.6	72.2	72.2	70.8
3	69.0	69.7	70.4	72.9	73.3	71.8	69.7	69.0
4	71.1	72.2	70.8	72.6	74.0	71.8	70.8	70.8
5	71.1	71.1	71.5	72.9	74.4	72.9	71.1	70.4
6	70.0	70.8	70.8	72.6	74.0	72.2	71.1	70.0
7	71.1	71.8	72.2	72.9	74.4	73.6	71.8	71.1

Change from original for each target checkpoint

v target \ donor >	0	1	2	3	4	5	6	7
0	11.9	12.6	15.5	11.6	15.9	10.5	13.7	12.3
1	8.7	10.1	12.3	9.0	11.9	10.8	9.7	7.6
2	11.6	11.2	14.4	11.9	13.7	12.3	12.3	10.8
3	9.7	10.5	11.2	13.7	14.1	12.6	10.5	9.7
4	7.6	8.7	7.2	9.0	10.5	8.3	7.2	7.2
5	11.2	11.2	11.6	13.0	14.4	13.0	11.2	10.5
6	11.6	12.3	12.3	14.1	15.5	13.7	12.6	11.6
7	8.3	9.0	9.4	10.1	11.6	10.8	9.0	8.3

Change from original target checkpoint with highest score

v target \ donor >	0	1	2	3	4	5	6	7
0	6.1	6.9	9.7	5.8	10.1	4.7	7.9	6.5
1	7.6	9.0	11.2	7.9	10.8	9.7	8.7	6.5
2	7.9	7.6	10.8	8.3	10.1	8.7	8.7	7.2
3	5.4	6.1	6.9	9.4	9.7	8.3	6.1	5.4
4	7.6	8.7	7.2	9.0	10.5	8.3	7.2	7.2
5	7.6	7.6	7.9	9.4	10.8	9.4	7.6	6.9
6	6.5	7.2	7.2	9.0	10.5	8.7	7.6	6.5
7	7.6	8.3	8.7	9.4	10.8	10.1	8.3	7.6

mmatena commented 3 years ago

Merging best MRPC & RTE checkpoint with the best MNLI checkpoint from the isotropic L2 0.0003 run

Results

Isotropic L2

Performance

iso	mrpc	rte
*0.0*	83.1	64.6
*0.0003*	84.7	70.8
*0.01*	82.9	70.8
*0.1*	82.9	68.6

Change from original

iso	mrpc	rte
*0.0*	-0.4	-1.4
*0.0003*	-0.2	7.2
*0.01*	-0.1	10.1
*0.1*	-0.4	7.6

EWC

Performance

ewc	mrpc	rte
*0.0*	83.1	64.6
*0.03*	83.3	61.7
*1.0*	82.7	67.1
*10.0*	85.1	72.6
*100.0*	82.2	66.4

Change from original

ewc	mrpc	rte
*0.0*	-0.4	-1.4
*0.03*	1.0	-1.8
*1.0*	-0.4	5.4
*10.0*	-1.3	9.4
*100.0*	-1.1	5.4

mmatena commented 3 years ago

Variational diagonal computation

Set-up

BERT-base
Diagonal fisher
RTE and MNLI checkpoints with best eval score.
Used only the checkpoints with isotropic regularization strength of 0.0003.
Epoch was defined as 4096, regardless of dataset size and Fisher examples.
Results

The merged model for this pair of checkpoints had a score of 70.8 on RTE.

Performance

Examples	4096	4096	4096	4096	4096	4096
Beta	1e-08	1e-08	1e-07	1e-07	1e-06	1e-06
Fisher epoch ⇩\ LR ⇨	0.001	0.01	0.001	0.01	0.001	0.01
0	72.2	71.8	72.2	72.9	72.2	71.5
1	72.2	72.2	72.2	73.3	72.2	72.6
2	72.9	72.2	71.5	72.2	72.2	72.6
3	72.9	72.2	71.8	71.5	72.2	72.6
4	72.9	72.2	72.2	72.6	72.2	72.2
5	72.9	72.2	71.8	71.8	71.5	71.8
6	72.9	72.6	72.2	71.5	71.8	71.8
7	72.9	72.2	72.2	71.5	72.2	71.5
8	72.9	72.6	72.2	72.2	72.2	71.1
9	72.9	72.6	72.6	72.2	72.2	71.1
10	72.9	72.6	72.9	72.2	72.2	70.8
11	72.9	72.2	72.6	71.8	72.2	70.8
12	72.9	72.9	72.6	71.8	71.8	71.1
13	72.6	72.6	72.2	72.2	71.8	71.5
14	72.2	72.2	71.8	72.2	71.8	71.8
15	72.2	72.6	72.2	71.8	71.8	71.8

Difference from direct estimation

Examples	4096	4096	4096	4096	4096	4096
Beta	1e-08	1e-08	1e-07	1e-07	1e-06	1e-06
Fisher epoch ⇩\ LR ⇨	0.001	0.01	0.001	0.01	0.001	0.01
0	1.4	1.0	1.4	2.1	1.4	0.7
1	1.4	1.4	1.4	2.5	1.4	1.8
2	2.1	1.4	0.7	1.4	1.4	1.8
3	2.1	1.4	1.0	0.7	1.4	1.8
4	2.1	1.4	1.4	1.8	1.4	1.4
5	2.1	1.4	1.0	1.0	0.7	1.0
6	2.1	1.8	1.4	0.7	1.0	1.0
7	2.1	1.4	1.4	0.7	1.4	0.7
8	2.1	1.8	1.4	1.4	1.4	0.3
9	2.1	1.8	1.8	1.4	1.4	0.3
10	2.1	1.8	2.1	1.4	1.4	-0.0
11	2.1	1.4	1.8	1.0	1.4	-0.0
12	2.1	2.1	1.8	1.0	1.0	0.3
13	1.8	1.8	1.4	1.4	1.0	0.7
14	1.4	1.4	1.0	1.4	1.0	1.0
15	1.4	1.8	1.4	1.0	1.0	1.0

Takeaways (based only on the 4096 MNLI example Fishers)

Variational Fisher computation is better than direct.
I should try again with lower learning rates and lower learning rates.
We can get away with far fewer than 16 epochs of size 4096 (emphasis that especially this might change when I compute MNLI on more examples).

Todos

Run variational fisher with more MNLI examples.
Run direct fisher with more MNLI examples.
Run variational fisher with lower beta and learning rate.
- For sure with 4096 examples.
- With more than 4096 examples, wait until results come in.
Run experiments to tell whether the variational vs direct approach is particularly important on the target vs donor.
Try best set up for the best MNLI-RTE pair found earlier with 4096 example direct.

mmatena commented 3 years ago

Examples	4096	4096	4096	4096	4096	4096	32768	32768	32768	32768	32768	32768	262144	262144	262144	262144	262144	262144
Beta	1e-08	1e-08	1e-07	1e-07	1e-06	1e-06	1e-08	1e-08	1e-07	1e-07	1e-06	1e-06	1e-08	1e-08	1e-07	1e-07	1e-06	1e-06
LR	0.001	0.01	0.001	0.01	0.001	0.01	0.001	0.01	0.001	0.01	0.001	0.01	0.001	0.01	0.001	0.01	0.001	0.01
0	72.2	71.8	72.2	72.9	72.2	71.5	72.9	72.2	52.7	72.2	71.5	72.2	72.2	73.3	72.2	71.8
1	72.2	72.2	72.2	73.3	72.2	72.6	72.6	72.2	52.7	72.2	72.6	72.2	72.6	72.6	72.2	72.6
2	72.9	72.2	71.5	72.2	72.2	72.6	73.3	71.5	52.7	72.2	72.6	72.6	72.6	72.2	72.2	72.6
3	72.9	72.2	71.8	71.5	72.2	72.6	72.9	71.8	52.7	71.8	72.6	72.9	72.9	72.6	71.8	72.6
4	72.9	72.2	72.2	72.6	72.2	72.2	73.3	72.2	52.7	72.2	71.8	72.9	72.9	72.2	71.8	72.2
5	72.9	72.2	71.8	71.8	71.5	71.8	73.3	71.8	52.7	71.5	71.8	72.6	72.2	72.6	71.5	71.8
6	72.9	72.6	72.2	71.5	71.8	71.8	72.6	72.2	52.7	71.8	71.8	72.6	72.6	71.8	71.8	71.8
7	72.9	72.2	72.2	71.5	72.2	71.5	72.2	72.2	52.7	72.2	71.5	72.6	72.6	71.8	72.2	71.5
8	72.9	72.6	72.2	72.2	72.2	71.1	71.8	72.6	52.7	72.2	71.5	72.6	72.6	72.2	72.2	71.5
9	72.9	72.6	72.6	72.2	72.2	71.1	71.8	72.6	52.7	72.2	71.1	72.6	72.9	72.6	72.2	71.1
10	72.9	72.6	72.9	72.2	72.2	70.8	72.2	72.6	52.7	72.6	71.1	72.9	73.3	72.6	72.2	70.8
11	72.9	72.2	72.6	71.8	72.2	70.8	72.6	72.6	52.7	72.6	71.1	72.6	72.9	71.8	72.2	70.8
12	72.9	72.9	72.6	71.8	71.8	71.1	72.6	72.6	52.7	71.8	71.5	72.6	73.6	71.8	72.2	71.1
13	72.6	72.6	72.2	72.2	71.8	71.5	72.6	72.2	52.7	72.2	71.5	72.6	73.6	71.8	72.6	71.5
14	72.2	72.2	71.8	72.2	71.8	71.8	72.2	71.8	52.7	72.2	71.8	72.6	73.6	72.2	72.2	71.8
15	72.2	72.6	72.2	71.8	71.8	71.8	71.8	72.2	52.7	71.8	71.8	72.2	73.3	72.2	72.2	71.8

mmatena commented 3 years ago

Performance

Examples	8192	8192	8192	8192	8192	8192
Beta	1e-10	1e-10	1e-10	1e-09	1e-09	1e-09
LR	1e-05	0.0001	0.001	1e-05	0.0001	0.001
0	71.1	71.1	72.2	71.1	71.5	72.2
1	71.1	72.2	72.6	71.1	72.2	72.6

Difference from direct estimation

Examples	8192	8192	8192	8192	8192	8192
Beta	1e-10	1e-10	1e-10	1e-09	1e-09	1e-09
LR	1e-05	0.0001	0.001	1e-05	0.0001	0.001
0	0.3	0.3	1.4	0.3	0.7	1.4
1	0.3	1.4	1.8	0.3	1.4	1.8

mmatena / m251