Open happynear opened 10 years ago
@happynear This is a good question, and should be a common one. Of course one can tune the gamma based on the validation set, but this is really annoying. We have tried that, but soon we came up with another idea to implement our formulation and avoid overfitting.
So if you look at our experiment configuration files, you can see we adopted an early stopping policy during the training process, i.e, we first train the network with DSN for a number of epochs (which is determined by validation) and we discard all the companion losses and continue to train the network with only the output loss.
The gamma now is implicitly and dynamically determined by the loss value achieved at the time when we early stop, empirically this is essential for DSN to achieve good performance.
In @happynear comment, \gamma is setted to prevent the hinge loss to be 0. However, from my point of view, \gamma is setted to $make$ the hinge loss of hidden layers to be 0(i.e. to vanish the gradient), so as \alpha_m does the same thing. But I don't know the purpose to vanish the gradient in the paper. Is it for speed up the training process because it can skip part of BP algorithm? @s9xie
@happynear @zhangliliang Sorry yes it is not "preventing hinge to be zero" but vanishing it. I assume it is a typo in original question?
In our paper we have explained that: "This way, the overall goal of producing good classification of the output layer is not altered and the companion objective just acts as a proxy or regularization." Intuitively we should emphasize the role of the overall loss during the training, this "early stop" policy can be a good way to avoid over-fitting the lower layers into the local loss.
@s9xie I am working on implementing your method. So, you mean that you don't explicitly use gamma
, right? Actually, I am also curious about the other hyperparameter alpha
, which requires exponential search space when layer increases. When I see your paper, you use relatively small architecture, i.e. 3-layer NN. How to tune this hyperparameter?
Thanks,
In formulation (3), there is a factor \gamma. This parameter is setted to prevent the hinge loss to be 0. However, I haven't find this parameter in the code. The 0 loss is quite common in deep learning and this phenomenon is usually called "overfit". In deep learning, people usually use dropout to prevent the loss from getting to zero too early.