Open tmgrgg opened 4 years ago
After plotting bands of constant space cost, I identify the top 10% of solutions in each band:
(Red lines represent each band of width 10, the dashed blue line is y = x).
Visually I'm not sure that this diagram actually shows anything interesting. (I'll get back to thinking about this)
The graph below however shows the best solutions for each fixed cost from 0 to 500 (the size of the star marker represents the cost of the solution) - it more clearly shows that with limited resources, we should dedicate them to ensembling. Note however that half of this diagram is missing to deliver a true comparison - but I do not expect ensembling further to increase the predictivity (this is somewhat clear from the constantness of the heatmap). However, I'll run the full 30 x 30 once I have access to the cluster.
In the below plot I have grouped each solution/square into bands (of width 20) and have normalised each solutions valid loss by subtracting the group mean and dividing by group standard deviation.
Below is similar with band-width of 10:
I reran this experiment, this time training the SWAG solutions for a bit longer (150 training epochs, 150 swag epochs versus 50 epochs and 100 swag epochs), and we get a very contrasting picture.
I trained a MultiSWAG solution consisting of (up to) 15 models on DenseNet10 x FashionMNIST, increasing the rank of each individual SWAG solution incrementally to produce this heatmap demonstrating a broad picture of the complementary benefits of modelling local and global uncertainty:
Observations:
It seems clear that increasing the rank of a unimodal SWAG approximation has a much weaker effect on solution quality than increasing the number of ensembled solutions.
It seems that once a certain threshold of ensembled solutions has been reached, i.e. between 5-10, resources are better dedicated to improving local approximations.
Note that for a fixed MultiSWAG model with n_ensembled = n, rank = k, moving to the above model in the graph incurs an additional storage cost of k (n k-rank modes -> (n + 1) k-rank modes), whereas moving rightwards incurs an additional storage cost of n (n k-rank modes -> n*(k + 1)-rank modes).
i.e. the cost of each solution can be read as kn|theta| I'm currently playing around with a couple of ideas for when to choose one over the other...