tmgrgg / localvsglobaluncertainty

Empirical analysis of recent stochastic gradient methods for approximate inference in Bayesian deep learning, including SWA-Gaussian, MultiSWAG, and deep ensembles. See report_localglobal.pdf.
2 stars 0 forks source link

Experiment 1a (Simple Analysis of SWAG) #1

Open tmgrgg opened 4 years ago

tmgrgg commented 4 years ago

Analysis of performance of SWAG versus approximation rank (1a)

(Dataset: FashionMNIST, Model: DenseNet with depth 10)

I wanted to get an idea for how the SWAG posterior approximation improved performance with rank. Since I was writing most of the code alongside running the experiments, the actual running of the code is done in Notebooks using a Colab GPU - I do intend to standardise these to be able to run parameterised experiments on the command line. For now I'll just link the notebooks.

1. I used SGD to descend to a suitably strong mode which would act as the pretrained solution for the SWAG sampling process.

pretrained_training_grapah

Final pretrained model performance is:

::: Train :::
 {'loss': 0.034561568461060524, 'accuracy': 99.41}
::: Valid :::
 {'loss': 0.23624498672485353, 'accuracy': 92.41}
::: Test :::
 {'loss': 0.2568485828399658, 'accuracy': 92.28}

Note also that I adopted the same learning rate schedule when training the initial solution as in the original paper, namely:

def schedule(lr_init, epoch, max_epochs):
    t = epoch / max_epochs
    lr_ratio = FINAL_LR / lr_init 
    if t <= 0.5:
        factor = 1.0
    elif t <= 0.9:
        factor = 1.0 - (1.0 - lr_ratio) * (t - 0.5) / 0.4
    else:
        factor = lr_ratio
    return lr_init * factor

2. I then used this pretrained solution to build a SWAG model (more as a test for the Posterior and Sampling classes).

The notebook is: https://colab.research.google.com/drive/1tma5QHPAM8K9dRjBfUV0Qv_C_yiQQtwP?usp=sharing

The model trained in the notebook above was trained with the following parameters:

SWA_LR = 0.005
SWA_MOMENTUM = 0.85
L2 = 1e-4
RANK = 30
SAMPLES_PER_EPOCH = 1
SAMPLE_FREQ = int((1/SAMPLES_PER_EPOCH)*len(train_set)/batch_size)
SAMPLING_CONDITION = lambda: True

I plotted the SWA (equivalent to expected value of SWAG) graph as a proxy for validation learning curve for the SWAG solution. SWA_training

I also plotted the number of samples drawn for Bayesian Model Averaging against performance of the final SWAG model: BMASamplesVsPerformance

Looks weird. May want to investigate the code.

3. I then used this pretrained solution to evaluate SWAG models trained with different approximation ranks, from k = 1 to 30 in steps of 2. I trained each for 100 swag epochs and performed one sample per epoch (same as above).

The training notebook is here: https://colab.research.google.com/drive/1L2D7aAXxOdrhK-Kk3FxUs9vhsjP_vTf6?usp=sharing Analysis notebook is here: https://colab.research.google.com/drive/1DHbudUH2BFdlJgmCopv93arsCx9advu6?usp=sharing

What I've found so far is a little surprising (but could very much be down to implementation error, or some other error in my experiment i.e. poor parameter choices):

SWAG Rank versus SWAG Performance on Train and Validation data (with N=30 for bayesian model averaging). swag_rank_v_performance_v1 )

I also plotted the SWA performance (we would expect this to be essentially constant as SWA is independent of rank). SWA_swag_rank_v_performance_v1

tmgrgg commented 4 years ago

After fixing a bunch of issues and comparing my implementation against a run of the original author's implementation to ascertain its correctness, I am now confident that our implementation's agree and have obtained the following results which replace those above:

This is with 30 samples in the bayesian model averages and 50 "swag samples":

Screenshot 2020-08-07 at 08 31 44

This is with 30 samples in the bayesian model averages and 150 "swag samples":

Screenshot 2020-08-07 at 11 07 15

Much clearer improvement for including local uncertainty!