Closed njan-creative closed 4 years ago
GAP is an abbreviation of Global Average Pooling, so yes, this fragment corresponds to GAP layer on the diagram. It is not listed when you print the model, because pytorch prints only modules that are in the model (i.e. print ignores content of the forward
function). Instead of x = F.adaptive_avg_pool2d(x, (1, 1))
you can equivalently use nn.AdaptiveAvgPool2d
(and then the layer will be printed).
Backprop goes through all layers which output depends on at least one trainable parameter, and backbone is trainable.
Thanks for the reply.
I did not recognize that. I was thinking that though the weights are not changed, they still play a role in back propagation like constant multiplier magnifying or reducing the effect of loss.
One more question. How are the parameters for the learning rate chosen.
In fastai library, they use a learning rate finder. Is it really helpful ? Is there any easier way to search for right learning rate parameter.
The learning rate finder is described in the paper "Cyclical Learning Rates for Training Neural Networks" by Leslie Smith.
https://github.com/davidtvs/pytorch-lr-finder
I was thinking that though the weights are not changed, they still play a role in back propagation like constant multiplier magnifying or reducing the effect of loss.
I don't know if I understand you correctly, but backpropagation is used to calculate gradients. If the layer and layers on which that layer depends on, don't require gradient, there's no point in calculating it. For example, the input doesn't require gradient, so backprop doesn't go though it, but obviously calculated gradients for layers depend on the input.
How are the parameters for the learning rate chosen.
The learning finder can be useful. Similar method would be to try a few learning rates and train for ~100 iterations or so, and compare loss after it. But both methods are not perfect. Probably the only way to find really optimal learning rate is to launch many full experiments with different learning rates and select the best one, but this is very time consuming and used only if you have enough computing power and want to squeeze out the last percents of the model. I didn't do this, as I had many other ideas I wanted to try. Some optimizers (e.g. Adam) makes selecting learning rate easier as are less sensitive to different learning rates (comparing to e.g. SGD with momentum). But have in mind that SGD can be superior to Adam in terms of accuracy, because the best training pipelines on e.g. Imagenet or COCO still uses SGD as it performs better, so it's a trade-off between time spent on hyperparameter search and the final accuracy. However, for practical uses, such accuracy boost is often not very significant and may be not worth it.
Thanks for the reply. Few more questions. Sorry to disturb you again.
Is there any particular range that we should search to get a good learning rate ?
I had read that the first layer is Global Average Pooling so that any size of images can be given as input. Similarly are there any general practices regarding the layers after the backbone. How it should be constructed or number of layers other than the one layer fully connected layer that maps from output features to the number of required classes. The one in this solution is slightly different from the previous solution that was quoted. https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/78109
Regarding the size of the embedding layer, is there any reason behind selecting 1024 ? In the previous solution that you quoted.. it as 2048 and then 1024. https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/78109 Is there any reason for selecting this value ?
Also if I recall correctly I had a few models with the different neck depth and embedding size in my final ensemble (which didn't change anything at the end).
Thanks for the clear explanations.
Regarding the GAP in the image below. Is the below code for the GAP part in the image ?
x = F.adaptive_avg_pool2d(x, (1, 1)) x = x.view(x.size(0), -1)
When I print the model, I think this is not listed in the same. Is it required to back prop through this one also ?