sql-machine-learning / elasticdl

Kubernetes-native Deep Learning Framework
https://elasticdl.org
MIT License
732 stars 113 forks source link

Investigating models with different initial values #179

Closed skydoorkai closed 5 years ago

skydoorkai commented 5 years ago

Experiment Setup

dataset: CIFAR10 model: resnet18 Init the model with random values, train for 30 epoch, save the model twice, at epoch 0 (just after initialization), and at epoch 30. Running the experiment for 8 times with different initial values, we get 8 models (A, B, C, D, E, F, G, H). Each model has two versions (just after init, and at epoch 30). For example, A = (A0, A30).

Model Accuracy at epoch 30

All the models reach similar accuracy around 91%.

A30 B30 C30 D30 E30 F30 G30 H30
0.9168 0.915 0.9186 0.9157 0.9179 0.9156 0.9157 0.9138

Distance Among Models

Euclidean distance is used for measuring the distance between two models. At epoch 0, we get the distances among these 8 models, they are all around 63.

A0 B0 C0 D0 E0 F0 G0 H0
A0 0 63.538 63.2943 63.293 63.2954 63.2495 63.3495 63.2896
B0 63.538 0 63.4434 63.196 63.5102 63.4507 63.5517 63.255
C0 63.2943 63.4434 0 63.2875 63.4452 63.3633 63.4893 63.4266
D0 63.293 63.196 63.2875 0 63.2782 63.3322 63.2968 63.3357
E0 63.2954 63.5102 63.4452 63.2782 0 63.447 63.5054 63.3675
F0 63.2495 63.4507 63.3633 63.3322 63.447 0 63.3896 63.2151
G0 63.3495 63.5517 63.4893 63.2968 63.5054 63.3896 0 63.4337
H0 63.2896 63.255 63.4266 63.3357 63.3675 63.2151 63.4337 0

At epoch 30, they are still distantly apart, around 43.

A30 B30 C30 D30 E30 F30 G30 H30
A30 0 43.0787 42.7444 42.8177 43.0152 42.7708 42.7432 42.8405
B30 43.0787 0 42.8902 42.9404 43.2092 42.9313 42.9368 42.9171
C30 42.7444 42.8902 0 42.711 42.8276 42.6194 42.6122 42.6917
D30 42.8177 42.9404 42.711 0 42.9537 42.6282 42.6784 42.6779
E30 43.0152 43.2092 42.8276 42.9537 0 42.8523 42.8247 42.8589
F30 42.7708 42.9313 42.6194 42.6282 42.8523 0 42.5544 42.6069
G30 42.7432 42.9368 42.6122 42.6784 42.8247 42.5544 0 42.6817
H30 42.8405 42.9171 42.6917 42.6779 42.8589 42.6069 42.6817 0

Also, the distance between epoch 0 and epoch 30 for each of the models.

epoch0-30 distance
A 53.2792
B 53.2698
C 53.2890
D 52.9610
E 53.4125
F 53.1456
G 53.2507
H 53.4660

We think the training process as finding the highest mountain in the high-dimensional model space, with the model parameters as the dimensions in this high-dimensional space. From the distance values above, the 8 models are climbing different mountains, and these mountains have similar altitude. The reason for the existing of multiple (or even infinite) highest mountains is that there may exist multiple (or even infinite) solutions to the optimization problem in the model training.

Accuracy Along Model Linear Interpolation Path

Giving 2 models M1 and M2, linear interpolation is used to get the model between these two in the model space:

M = (1 - alpha) * M1 + alpha * M2
for alpha between 0 and 1

The accuracies along 7 paths between model A and other 7 models at epoch 30 are depicted below. m8

This verifies our assumption that each model training is climbing a different mountains.

Conclusion

There is no mountain in the model space that is higher than any other mountains.
There are numerous (or infinite) mountains, all with the highest altitude. Initial value and training strategy will determine which mountain a training process will eventually climb.

yuyicg commented 5 years ago

maybe the mountain looks like this:

screenshot from 2019-01-17 23-20-46 picture from how neural networks are trained

and according to the point of view (empirically) in this paper the loss surfaces of multilayer networks:

zou000 commented 5 years ago

这篇文章比较新一点,有一个结论是用skip connection可以使loss surface 更加 convex。这里有更多的例子:https://www.cs.umd.edu/~tomg/projects/landscapes/

skydoorkai commented 5 years ago

This 2015 paper has already done the work of the model accuracy drawing along two model interpolation path , and get the similar result.

If we think model A and B with similar accuracy are on the same mountain, if must satisfy: model M = A + alpha * (B - A) for alpha between (0, 1), the accuracy of M should not be significantly lower than the accuracy of A or B. So, from the above drawing, the models are on different mountains.

From the paper Visualizing the Loss Landscape of Neural Nets, it said (1)if a model architecture is good enough (such as Resnet), the landscape is nearly convex, so training can converge. If a model architecture is bad(such as Resnet without skip connections), there are many local minima of the loss, and the training will fail. (2) This 2015 paper is not a good way to compare two models with values along, as there are many invariance in the model, such as scale invariance from BatchNorm, or more complicated invariance from Relu (the negative input part). So even if the values of layer diff a lot between two models, they may be equivalent. It propose a filter normalization to solve the scale invariance problem.

There must be lots of invariance for a model, and each invariance will result in infinite mountains with some height, but significant difference in model values.

Also, there are some research papers stating that there are no spurious local minima in some DL models, so each local minima is some as global minima. But this paper Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima gives out a sample that there exists a spurious local minima, but state that if the initialization is good (using weight normalization on random values), the probability of avoiding the spurious local minima is very high. So spurious local minima is not a problem.

skydoorkai commented 5 years ago

More experiments

We will generate some 2D heatmaps from 3 models to represent the model accuracy in the model space on a cross section defined by the 3 models.

Giving model A, B, C, we define the models on the cross section as: M(u, v) = A + u (B - A) + v (C - A) so M(0, 0) = A, M(1, 0) = B, M(0, 1) = C Here, we use two vectors (B - A) and (C - A) to create a 2D space. Note that (B - A) and (C - A) are not orthogonal, though we draw the 2D heatmap as if u and v axis are orthogonal to simplify the drawing. We draw a patch with u and v in (-0.5, 1.5) range. In the drawing title, Best is the best accuracy in this 2D patch.

  1. A, B, C training with different random initializations A, B, C are on different mountains, though their accuracies are nearly the same. random

  2. A, B, C training with same random initialization, but with different training data order Also, A, B, C are on different mountains, though the ravines between them are not deep as 1. same_init

  3. A, B, C training with same initialization from a model with accuracy 0.339, but with different training data order (a) after 0.2 epoch For the area inside of A, B, C, all the models on it have similar or even a little higher accuracy than A, B, C. So, A, B, C are on the same mountain. same_e0b300

(b) after 1 epoch Still similar as (a)

same_e1

(c) after 9 epoch Now, though the the area inside A, B, C still has good accuracy, A, B, C are the highest. A, B, C are still in the same mountain, but on different high rocks. same_e9

skydoorkai commented 5 years ago

Another experiment

Instead of training for high accuracy to find the highest mountain, we train the model to get the lowest accuracy to find the abyss by changing the loss function from:

loss_fn = torch.nn.CrossEntropyLoss()
loss = loss_fn(output, target)

to:

softmax_fn = torch.nn.Softmax()
nllloss_fn = torch.nn.NLLLoss()
softmax_out = softmax_fn(output)
softmax_out = torch.add(softmax_out, -1.0)
log_out = torch.log(-softmax_out)
loss = nllloss_fn(log_out, target)

A, B, C are trained with different random initialization, with different training data order. low_heatmap

Also, accuracy of the models from linear interpolation along A, and B as M(x) = A + x * (B - A), with x in (-1, 2) range. m

skydoorkai commented 5 years ago

From the experiments above, we can see that for CIFAR10 with Resnet18, any random initialization will result in the vast plain with accuracy = 0.1.

If starting from the plain(acc=0.1), with the randomness in the training data, each worker will climb to different mountains.

If the staring point of all the workers are already on a mountain (such as at acc = 0.339), they will climb on the same mountain, with similar pace (the accuracy is determined on the number of training data it processed until it can climb no further higher).

There are infinite number of mountains, and also there are infinite number of abysses. From the experiments, all the mountains from Resnet18 training have similar height. Probably that Resnet18 is a good model, so there is no spurious local minima.