mmatena / m251

0 stars 0 forks source link

First paper experiment logs #13

Open mmatena opened 3 years ago

mmatena commented 3 years ago

The idea is that I'll create a comment with a description and results from each experiment. I can move experiments into their own issue if so desired and replace their comment here with a link to the new issue.

mmatena commented 3 years ago

Impact of L2 regularization strength on fine-tuning performance

Set up

Done on RoBERTa-large, trained with batch size of 8 for 200k examples. Scores for best checkpoint. Single run.

Caveats

Results

λ L2 cola mnli mrpc qnli qqp rte sst2 stsb
0.0 65.7 87.5 90.7 91.9 86.9 83.8 95.3 90.8
0.0003 65.6 87.5 90.1 92.2 87.3 83.4 96.0 90.3
0.01 0.0 87.4 91.4 91.9 75.6 86.6 95.5 90.6

Notes

Take-aways

mmatena commented 3 years ago

Impact of L2 regularization strength on relative performance on merging

Set up

Caveats

Results

λ L2 cola mnli mrpc qnli qqp rte sst2 stsb Average
0.0 100.1 100.6 101.0 100.8 100.4 102.6 101.1 100.0 100.8
0.0003 99.6 100.9 100.7 100.5 100.2 102.6 100.4 99.9 100.6
0.01 N/A 100.5 100.0 100.6 100.3 102.1 100.8 100.0 100.6

Take-aways

mmatena commented 3 years ago

[PHASE I] Impact of EWC regularization strength on fine-tuning performance

Goal

Preliminary experiment to get an idea for the range of regularization strengths needed for EWC.

Set up

Caveats

Results

Accuracy of best checkpoint on SST-2

0.0 1e-05 0.0001 0.001 0.01 0.1 1.0 10.0 100.0
train 96.6 96.3 96.5 96.6 96.5 96.3 96.4 95.3 92.2
dev 92.9 92.8 92.8 92.9 92.4 92.9 92.5 92.1 91.7

Take-aways

mmatena commented 3 years ago

Impact of regularization on fine-tuning performance

Set-up

Caveats

Results

Isotropic L2

cola mnli mrpc qnli qqp rte sst2 stsb Average
None 57.9 83.8 83.5 90.6 89.7 66.1 92.7 86.0 81.3
0.0003 59.4 83.8 84.9 90.6 90.4 63.5 92.3 83.9 81.1
0.01 59.5 81.6 83.0 90.0 85.7 60.6 92.0 85.3 79.7
0.1 53.6 75.4 83.3 86.0 81.3 61.0 91.7 82.8 76.9

EWC

cola mnli mrpc qnli qqp rte sst2 stsb Average
None 57.9 83.8 83.5 90.6 89.7 66.1 92.7 86.0 81.3
0.03 58.4 83.9 82.2 90.6 90.2 63.5 92.9 85.4 80.9
1.0 57.4 83.8 83.2 90.6 89.6 61.7 92.3 83.8 80.3
10.0 58.5 82.4 86.4 90.3 87.2 63.2 92.3 84.0 80.5
100.0 57.6 76.9 83.3 87.5 83.4 61.0 91.5 83.2 78.1

Takeaways

mmatena commented 3 years ago

Impact of regularization on informed pairwise merging absolute performance

Results

Isotropic L2

Performance
iso cola mnli mrpc qnli qqp rte sst2 stsb Average
0.0 59.6 84.2 83.3 90.9 89.5 66.4 93.0 85.6 81.6
0.0003 59.3 84.9 N/A 89.5 89.1 70.8 92.1 81.3 N/A
0.01 59.9 81.4 83.7 89.5 85.5 69.7 92.2 84.7 80.8
0.1 53.4 75.1 83.2 85.8 81.0 63.9 91.3 81.5 76.9
Change from original
iso cola mnli mrpc qnli qqp rte sst2 stsb
0.0 1.7 0.4 -0.2 0.3 -0.2 0.4 0.3 -0.4
0.0003 -0.1 0.0 N/A -1.0 -1.2 7.2 -0.2 -2.6
0.01 0.4 -0.2 0.6 -0.5 -0.2 9.0 0.2 -0.5
0.1 -0.3 0.0 -0.1 -0.2 -0.3 2.9 -0.5 -1.2

EWC

Performance
ewc cola mnli mrpc qnli qqp rte sst2 stsb Average
0.0 59.6 84.2 83.3 90.9 89.5 66.4 93.0 85.6 81.6
0.03 58.3 83.8 83.9 90.9 90.3 66.1 92.2 85.6 81.4
1.0 58.0 84.0 83.8 90.4 89.5 68.6 92.9 83.5 81.3
10.0 57.9 82.7 86.8 90.1 86.9 72.2 92.3 84.2 81.6
100.0 57.4 76.8 83.7 87.0 82.9 66.4 91.3 84.3 78.7
Change from original
ewc cola mnli mrpc qnli qqp rte sst2 stsb
0.0 1.7 0.4 -0.2 0.3 -0.2 0.4 0.3 -0.4
0.03 -0.1 -0.1 1.7 0.3 0.0 2.5 -0.7 0.2
1.0 0.6 0.2 0.6 -0.2 -0.1 6.9 0.6 -0.3
10.0 -0.6 0.3 0.4 -0.1 -0.3 9.0 0.0 0.1
100.0 -0.1 -0.1 0.4 -0.6 -0.5 5.4 -0.2 1.0
mmatena commented 3 years ago

Merging best RTE checkpoint with each MNLI checkpoint from training

Results

Columns are MNLI checkpoint indices. Each index corresponds to training on half an epoch of MNLI.

Isotropic L2

Performance
iso 0 1 2 3 4 5 6 7
0.0 66.4 66.8 66.1 66.4 66.4 66.1 66.4 66.1 66.4
0.0003 71.1 72.2 70.8 72.6 74.0 71.8 70.8 70.8 70.8
0.01 66.8 68.6 70.0 67.9 67.9 67.9 69.7 69.7 70.4
0.1 60.6 59.9 60.3 61.4 61.4 62.1 63.9 63.9 63.9
Change from original
iso 0 1 2 3 4 5 6 7
0.0 0.4 0.7 0.0 0.4 0.4 0.0 0.4 0.0 0.4
0.0003 7.6 8.7 7.2 9.0 10.5 8.3 7.2 7.2 7.2
0.01 6.1 7.9 9.4 7.2 7.2 7.2 9.0 9.0 9.7
0.1 -0.4 -1.1 -0.7 0.4 0.4 1.1 2.9 2.9 2.9

EWC

Performance
ewc 0 1 2 3 4 5 6 7
0.0 66.4 66.8 66.1 66.4 66.4 66.1 66.4 66.1 66.4
0.03 63.9 64.3 66.1 66.1 64.6 64.6 62.5 63.2 62.1
1.0 68.6 67.5 69.0 66.8 68.6 68.6 67.5 70.0 68.2
10.0 68.2 70.0 71.1 70.4 72.2 72.2 71.1 71.1 72.9
100.0 65.0 65.7 66.4 66.4 65.0 64.6 65.3 64.3 63.2
Change from original
ewc 0 1 2 3 4 5 6 7
0.0 0.4 0.7 0.0 0.4 0.4 0.0 0.4 0.0 0.4
0.03 0.4 0.7 2.5 2.5 1.1 1.1 -1.1 -0.4 -1.4
1.0 6.9 5.8 7.2 5.1 6.9 6.9 5.8 8.3 6.5
10.0 5.1 6.9 7.9 7.2 9.0 9.0 7.9 7.9 9.7
100.0 4.0 4.7 5.4 5.4 4.0 3.6 4.3 3.2 2.2
mmatena commented 3 years ago

Merging best MRPC checkpoint with each MNLI checkpoint from training

Results

Columns are MNLI checkpoint indices. Each index corresponds to training on half an epoch of MNLI.

Isotropic L2

Performance
iso 0 1 2 3 4 5 6 7
0.0 83.6 83.9 83.4 83.2 83.6 83.6 83.4 83.2
0.0003 85.3 85.5 85.1 85.1 84.9 84.7 84.5 84.7
0.01 83.1 83.1 82.9 83.1 83.1 83.3 82.7 83.1
0.1 83.5 82.9 82.9 83.8 83.2 83.3 83.2 84.0
Change from original
iso 0 1 2 3 4 5 6 7
0.0 0.1 0.4 -0.1 -0.3 0.1 0.0 -0.1 -0.4
0.0003 0.4 0.6 0.2 0.2 0.0 -0.2 -0.3 -0.2
0.01 0.1 0.0 -0.1 0.0 0.0 0.2 -0.3 0.0
0.1 0.2 -0.4 -0.4 0.5 -0.1 0.0 -0.1 0.7

EWC

Performance
ewc 0 1 2 3 4 5 6 7
0.0 83.6 83.9 83.4 83.2 83.6 83.6 83.4 83.2
0.03 83.8 83.7 83.3 83.2 83.3 83.4 83.2 83.3
1.0 84.0 84.3 83.5 83.7 83.8 83.7 83.2 83.7
10.0 87.3 87.3 86.7 87.2 86.3 87.3 87.1 87.2
100.0 82.0 82.4 82.3 83.1 83.1 83.1 83.1 83.1
Change from original
ewc 0 1 2 3 4 5 6 7
0.0 0.1 0.4 -0.1 -0.3 0.1 0.0 -0.1 -0.4
0.03 1.6 1.4 1.1 1.0 1.0 1.2 1.0 1.0
1.0 0.8 1.2 0.3 0.5 0.6 0.5 0.0 0.6
10.0 0.9 0.9 0.3 0.9 -0.1 0.9 0.7 0.9
100.0 -1.3 -0.9 -1.0 -0.2 -0.2 -0.2 -0.2 -0.2
mmatena commented 3 years ago

Merging all checkpoints of an RTE run with all checkpoints of an MNLI run

Both runs has isotropic L2 regularization with a strength of 0.0003. Each checkpoint index corresponds to another half epoch of training, which consists of a different number of examples for each task.

Results

Performance

v target \ donor > 0 1 2 3 4 5 6 7
0 69.7 70.4 73.3 69.3 73.6 68.2 71.5 70.0
1 71.1 72.6 74.7 71.5 74.4 73.3 72.2 70.0
2 71.5 71.1 74.4 71.8 73.6 72.2 72.2 70.8
3 69.0 69.7 70.4 72.9 73.3 71.8 69.7 69.0
4 71.1 72.2 70.8 72.6 74.0 71.8 70.8 70.8
5 71.1 71.1 71.5 72.9 74.4 72.9 71.1 70.4
6 70.0 70.8 70.8 72.6 74.0 72.2 71.1 70.0
7 71.1 71.8 72.2 72.9 74.4 73.6 71.8 71.1

Change from original for each target checkpoint

v target \ donor > 0 1 2 3 4 5 6 7
0 11.9 12.6 15.5 11.6 15.9 10.5 13.7 12.3
1 8.7 10.1 12.3 9.0 11.9 10.8 9.7 7.6
2 11.6 11.2 14.4 11.9 13.7 12.3 12.3 10.8
3 9.7 10.5 11.2 13.7 14.1 12.6 10.5 9.7
4 7.6 8.7 7.2 9.0 10.5 8.3 7.2 7.2
5 11.2 11.2 11.6 13.0 14.4 13.0 11.2 10.5
6 11.6 12.3 12.3 14.1 15.5 13.7 12.6 11.6
7 8.3 9.0 9.4 10.1 11.6 10.8 9.0 8.3

Change from original target checkpoint with highest score

v target \ donor > 0 1 2 3 4 5 6 7
0 6.1 6.9 9.7 5.8 10.1 4.7 7.9 6.5
1 7.6 9.0 11.2 7.9 10.8 9.7 8.7 6.5
2 7.9 7.6 10.8 8.3 10.1 8.7 8.7 7.2
3 5.4 6.1 6.9 9.4 9.7 8.3 6.1 5.4
4 7.6 8.7 7.2 9.0 10.5 8.3 7.2 7.2
5 7.6 7.6 7.9 9.4 10.8 9.4 7.6 6.9
6 6.5 7.2 7.2 9.0 10.5 8.7 7.6 6.5
7 7.6 8.3 8.7 9.4 10.8 10.1 8.3 7.6
mmatena commented 3 years ago

Merging best MRPC & RTE checkpoint with the best MNLI checkpoint from the isotropic L2 0.0003 run

Results

Isotropic L2

Performance
iso mrpc rte
0.0 83.1 64.6
0.0003 84.7 70.8
0.01 82.9 70.8
0.1 82.9 68.6
Change from original
iso mrpc rte
0.0 -0.4 -1.4
0.0003 -0.2 7.2
0.01 -0.1 10.1
0.1 -0.4 7.6

EWC

Performance
ewc mrpc rte
0.0 83.1 64.6
0.03 83.3 61.7
1.0 82.7 67.1
10.0 85.1 72.6
100.0 82.2 66.4
Change from original
ewc mrpc rte
0.0 -0.4 -1.4
0.03 1.0 -1.8
1.0 -0.4 5.4
10.0 -1.3 9.4
100.0 -1.1 5.4
mmatena commented 3 years ago

Variational diagonal computation

Set-up

Performance

Examples 4096 4096 4096 4096 4096 4096
Beta 1e-08 1e-08 1e-07 1e-07 1e-06 1e-06
Fisher epoch ⇩\ LR ⇨ 0.001 0.01 0.001 0.01 0.001 0.01
0 72.2 71.8 72.2 72.9 72.2 71.5
1 72.2 72.2 72.2 73.3 72.2 72.6
2 72.9 72.2 71.5 72.2 72.2 72.6
3 72.9 72.2 71.8 71.5 72.2 72.6
4 72.9 72.2 72.2 72.6 72.2 72.2
5 72.9 72.2 71.8 71.8 71.5 71.8
6 72.9 72.6 72.2 71.5 71.8 71.8
7 72.9 72.2 72.2 71.5 72.2 71.5
8 72.9 72.6 72.2 72.2 72.2 71.1
9 72.9 72.6 72.6 72.2 72.2 71.1
10 72.9 72.6 72.9 72.2 72.2 70.8
11 72.9 72.2 72.6 71.8 72.2 70.8
12 72.9 72.9 72.6 71.8 71.8 71.1
13 72.6 72.6 72.2 72.2 71.8 71.5
14 72.2 72.2 71.8 72.2 71.8 71.8
15 72.2 72.6 72.2 71.8 71.8 71.8

Difference from direct estimation

Examples 4096 4096 4096 4096 4096 4096
Beta 1e-08 1e-08 1e-07 1e-07 1e-06 1e-06
Fisher epoch ⇩\ LR ⇨ 0.001 0.01 0.001 0.01 0.001 0.01
0 1.4 1.0 1.4 2.1 1.4 0.7
1 1.4 1.4 1.4 2.5 1.4 1.8
2 2.1 1.4 0.7 1.4 1.4 1.8
3 2.1 1.4 1.0 0.7 1.4 1.8
4 2.1 1.4 1.4 1.8 1.4 1.4
5 2.1 1.4 1.0 1.0 0.7 1.0
6 2.1 1.8 1.4 0.7 1.0 1.0
7 2.1 1.4 1.4 0.7 1.4 0.7
8 2.1 1.8 1.4 1.4 1.4 0.3
9 2.1 1.8 1.8 1.4 1.4 0.3
10 2.1 1.8 2.1 1.4 1.4 -0.0
11 2.1 1.4 1.8 1.0 1.4 -0.0
12 2.1 2.1 1.8 1.0 1.0 0.3
13 1.8 1.8 1.4 1.4 1.0 0.7
14 1.4 1.4 1.0 1.4 1.0 1.0
15 1.4 1.8 1.4 1.0 1.0 1.0

Takeaways (based only on the 4096 MNLI example Fishers)

Todos

mmatena commented 3 years ago
Examples 4096 4096 4096 4096 4096 4096 32768 32768 32768 32768 32768 32768 262144 262144 262144 262144 262144 262144
Beta 1e-08 1e-08 1e-07 1e-07 1e-06 1e-06 1e-08 1e-08 1e-07 1e-07 1e-06 1e-06 1e-08 1e-08 1e-07 1e-07 1e-06 1e-06
LR 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01 0.001 0.01
0 72.2 71.8 72.2 72.9 72.2 71.5 72.9 72.2 52.7 72.2 71.5 72.2 72.2 73.3 72.2 71.8
1 72.2 72.2 72.2 73.3 72.2 72.6 72.6 72.2 52.7 72.2 72.6 72.2 72.6 72.6 72.2 72.6
2 72.9 72.2 71.5 72.2 72.2 72.6 73.3 71.5 52.7 72.2 72.6 72.6 72.6 72.2 72.2 72.6
3 72.9 72.2 71.8 71.5 72.2 72.6 72.9 71.8 52.7 71.8 72.6 72.9 72.9 72.6 71.8 72.6
4 72.9 72.2 72.2 72.6 72.2 72.2 73.3 72.2 52.7 72.2 71.8 72.9 72.9 72.2 71.8 72.2
5 72.9 72.2 71.8 71.8 71.5 71.8 73.3 71.8 52.7 71.5 71.8 72.6 72.2 72.6 71.5 71.8
6 72.9 72.6 72.2 71.5 71.8 71.8 72.6 72.2 52.7 71.8 71.8 72.6 72.6 71.8 71.8 71.8
7 72.9 72.2 72.2 71.5 72.2 71.5 72.2 72.2 52.7 72.2 71.5 72.6 72.6 71.8 72.2 71.5
8 72.9 72.6 72.2 72.2 72.2 71.1 71.8 72.6 52.7 72.2 71.5 72.6 72.6 72.2 72.2 71.5
9 72.9 72.6 72.6 72.2 72.2 71.1 71.8 72.6 52.7 72.2 71.1 72.6 72.9 72.6 72.2 71.1
10 72.9 72.6 72.9 72.2 72.2 70.8 72.2 72.6 52.7 72.6 71.1 72.9 73.3 72.6 72.2 70.8
11 72.9 72.2 72.6 71.8 72.2 70.8 72.6 72.6 52.7 72.6 71.1 72.6 72.9 71.8 72.2 70.8
12 72.9 72.9 72.6 71.8 71.8 71.1 72.6 72.6 52.7 71.8 71.5 72.6 73.6 71.8 72.2 71.1
13 72.6 72.6 72.2 72.2 71.8 71.5 72.6 72.2 52.7 72.2 71.5 72.6 73.6 71.8 72.6 71.5
14 72.2 72.2 71.8 72.2 71.8 71.8 72.2 71.8 52.7 72.2 71.8 72.6 73.6 72.2 72.2 71.8
15 72.2 72.6 72.2 71.8 71.8 71.8 71.8 72.2 52.7 71.8 71.8 72.2 73.3 72.2 72.2 71.8
mmatena commented 3 years ago
Performance
Examples 8192 8192 8192 8192 8192 8192
Beta 1e-10 1e-10 1e-10 1e-09 1e-09 1e-09
LR 1e-05 0.0001 0.001 1e-05 0.0001 0.001
0 71.1 71.1 72.2 71.1 71.5 72.2
1 71.1 72.2 72.6 71.1 72.2 72.6
Difference from direct estimation
Examples 8192 8192 8192 8192 8192 8192
Beta 1e-10 1e-10 1e-10 1e-09 1e-09 1e-09
LR 1e-05 0.0001 0.001 1e-05 0.0001 0.001
0 0.3 0.3 1.4 0.3 0.7 1.4
1 0.3 1.4 1.8 0.3 1.4 1.8