skydoorkai commented 5 years ago

Experiment Setup

dataset: CIFAR10 model: resnet18 Init the model with random values, train for 30 epoch, save the model twice, at epoch 0 (just after initialization), and at epoch 30. Running the experiment for 8 times with different initial values, we get 8 models (A, B, C, D, E, F, G, H). Each model has two versions (just after init, and at epoch 30). For example, A = (A0, A30).

Model Accuracy at epoch 30

All the models reach similar accuracy around 91%.

A30	B30	C30	D30	E30	F30	G30	H30
0.9168	0.915	0.9186	0.9157	0.9179	0.9156	0.9157	0.9138

Distance Among Models

Euclidean distance is used for measuring the distance between two models. At epoch 0, we get the distances among these 8 models, they are all around 63.

A0	B0	C0	D0	E0	F0	G0	H0
A0	0	63.538	63.2943	63.293	63.2954	63.2495	63.3495	63.2896
B0	63.538	0	63.4434	63.196	63.5102	63.4507	63.5517	63.255
C0	63.2943	63.4434	0	63.2875	63.4452	63.3633	63.4893	63.4266
D0	63.293	63.196	63.2875	0	63.2782	63.3322	63.2968	63.3357
E0	63.2954	63.5102	63.4452	63.2782	0	63.447	63.5054	63.3675
F0	63.2495	63.4507	63.3633	63.3322	63.447	0	63.3896	63.2151
G0	63.3495	63.5517	63.4893	63.2968	63.5054	63.3896	0	63.4337
H0	63.2896	63.255	63.4266	63.3357	63.3675	63.2151	63.4337	0

At epoch 30, they are still distantly apart, around 43.

A30	B30	C30	D30	E30	F30	G30	H30
A30	0	43.0787	42.7444	42.8177	43.0152	42.7708	42.7432	42.8405
B30	43.0787	0	42.8902	42.9404	43.2092	42.9313	42.9368	42.9171
C30	42.7444	42.8902	0	42.711	42.8276	42.6194	42.6122	42.6917
D30	42.8177	42.9404	42.711	0	42.9537	42.6282	42.6784	42.6779
E30	43.0152	43.2092	42.8276	42.9537	0	42.8523	42.8247	42.8589
F30	42.7708	42.9313	42.6194	42.6282	42.8523	0	42.5544	42.6069
G30	42.7432	42.9368	42.6122	42.6784	42.8247	42.5544	0	42.6817
H30	42.8405	42.9171	42.6917	42.6779	42.8589	42.6069	42.6817	0

Also, the distance between epoch 0 and epoch 30 for each of the models.

epoch0-30 distance
A	53.2792
B	53.2698
C	53.2890
D	52.9610
E	53.4125
F	53.1456
G	53.2507
H	53.4660

We think the training process as finding the highest mountain in the high-dimensional model space, with the model parameters as the dimensions in this high-dimensional space. From the distance values above, the 8 models are climbing different mountains, and these mountains have similar altitude. The reason for the existing of multiple (or even infinite) highest mountains is that there may exist multiple (or even infinite) solutions to the optimization problem in the model training.

Accuracy Along Model Linear Interpolation Path

Giving 2 models M1 and M2, linear interpolation is used to get the model between these two in the model space:

M = (1 - alpha) * M1 + alpha * M2
for alpha between 0 and 1

The accuracies along 7 paths between model A and other 7 models at epoch 30 are depicted below.

This verifies our assumption that each model training is climbing a different mountains.

Conclusion

There is no mountain in the model space that is higher than any other mountains.
There are numerous (or infinite) mountains, all with the highest altitude. Initial value and training strategy will determine which mountain a training process will eventually climb.

yuyicg commented 5 years ago

maybe the mountain looks like this:

screenshot from 2019-01-17 23-20-46 picture from how neural networks are trained

and according to the point of view (empirically) in this paper the loss surfaces of multilayer networks:

For large-size networks, most local minima are equivalent and yield similar performance on a test set.
The probability of finding a “bad” (high value) local minimum is non-zero for small-size networks and decreases quickly with network size.
Struggling to find the global minimum on the training set (as opposed to one of the many good local ones) is not useful in practice and may lead to overfitting.

zou000 commented 5 years ago

这篇文章比较新一点，有一个结论是用skip connection可以使loss surface 更加 convex。这里有更多的例子：https://www.cs.umd.edu/~tomg/projects/landscapes/

skydoorkai commented 5 years ago

This 2015 paper has already done the work of the model accuracy drawing along two model interpolation path , and get the similar result.

If we think model A and B with similar accuracy are on the same mountain, if must satisfy: model M = A + alpha * (B - A) for alpha between (0, 1), the accuracy of M should not be significantly lower than the accuracy of A or B. So, from the above drawing, the models are on different mountains.

From the paper Visualizing the Loss Landscape of Neural Nets, it said (1)if a model architecture is good enough (such as Resnet), the landscape is nearly convex, so training can converge. If a model architecture is bad(such as Resnet without skip connections), there are many local minima of the loss, and the training will fail. (2) This 2015 paper is not a good way to compare two models with values along, as there are many invariance in the model, such as scale invariance from BatchNorm, or more complicated invariance from Relu (the negative input part). So even if the values of layer diff a lot between two models, they may be equivalent. It propose a filter normalization to solve the scale invariance problem.

There must be lots of invariance for a model, and each invariance will result in infinite mountains with some height, but significant difference in model values.

Also, there are some research papers stating that there are no spurious local minima in some DL models, so each local minima is some as global minima. But this paper Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima gives out a sample that there exists a spurious local minima, but state that if the initialization is good (using weight normalization on random values), the probability of avoiding the spurious local minima is very high. So spurious local minima is not a problem.

skydoorkai commented 5 years ago

More experiments

We will generate some 2D heatmaps from 3 models to represent the model accuracy in the model space on a cross section defined by the 3 models.

Giving model A, B, C, we define the models on the cross section as: M(u, v) = A + u (B - A) + v (C - A) so M(0, 0) = A, M(1, 0) = B, M(0, 1) = C Here, we use two vectors (B - A) and (C - A) to create a 2D space. Note that (B - A) and (C - A) are not orthogonal, though we draw the 2D heatmap as if u and v axis are orthogonal to simplify the drawing. We draw a patch with u and v in (-0.5, 1.5) range. In the drawing title, Best is the best accuracy in this 2D patch.

A, B, C training with different random initializations A, B, C are on different mountains, though their accuracies are nearly the same.
A, B, C training with same random initialization, but with different training data order Also, A, B, C are on different mountains, though the ravines between them are not deep as 1.
A, B, C training with same initialization from a model with accuracy 0.339, but with different training data order (a) after 0.2 epoch For the area inside of A, B, C, all the models on it have similar or even a little higher accuracy than A, B, C. So, A, B, C are on the same mountain.

(b) after 1 epoch Still similar as (a)

same_e1

(c) after 9 epoch Now, though the the area inside A, B, C still has good accuracy, A, B, C are the highest. A, B, C are still in the same mountain, but on different high rocks. same_e9

skydoorkai commented 5 years ago

Another experiment

Instead of training for high accuracy to find the highest mountain, we train the model to get the lowest accuracy to find the abyss by changing the loss function from:

loss_fn = torch.nn.CrossEntropyLoss()
loss = loss_fn(output, target)

to:

softmax_fn = torch.nn.Softmax()
nllloss_fn = torch.nn.NLLLoss()
softmax_out = softmax_fn(output)
softmax_out = torch.add(softmax_out, -1.0)
log_out = torch.log(-softmax_out)
loss = nllloss_fn(log_out, target)

A, B, C are trained with different random initialization, with different training data order. low_heatmap

Also, accuracy of the models from linear interpolation along A, and B as M(x) = A + x * (B - A), with x in (-1, 2) range.

skydoorkai commented 5 years ago

From the experiments above, we can see that for CIFAR10 with Resnet18, any random initialization will result in the vast plain with accuracy = 0.1.

If starting from the plain(acc=0.1), with the randomness in the training data, each worker will climb to different mountains.

If the staring point of all the workers are already on a mountain (such as at acc = 0.339), they will climb on the same mountain, with similar pace (the accuracy is determined on the number of training data it processed until it can climb no further higher).

There are infinite number of mountains, and also there are infinite number of abysses. From the experiments, all the mountains from Resnet18 training have similar height. Probably that Resnet18 is a good model, so there is no spurious local minima.

sql-machine-learning / elasticdl

Investigating models with different initial values #179

Experiment Setup

Model Accuracy at epoch 30

Distance Among Models

Accuracy Along Model Linear Interpolation Path

Conclusion