maciej-sypetkowski / kaggle-rcic-1st

1st Place Solution for Kaggle Recursion Cellular Image Classification Challenge -- https://www.kaggle.com/c/recursion-cellular-image-classification/
MIT License
140 stars 40 forks source link

GAP layer #3

Closed njan-creative closed 4 years ago

njan-creative commented 4 years ago

Regarding the GAP in the image below. Is the below code for the GAP part in the image ?

x = F.adaptive_avg_pool2d(x, (1, 1)) x = x.view(x.size(0), -1)

When I print the model, I think this is not listed in the same. Is it required to back prop through this one also ?

image

maciej-sypetkowski commented 4 years ago

GAP is an abbreviation of Global Average Pooling, so yes, this fragment corresponds to GAP layer on the diagram. It is not listed when you print the model, because pytorch prints only modules that are in the model (i.e. print ignores content of the forward function). Instead of x = F.adaptive_avg_pool2d(x, (1, 1)) you can equivalently use nn.AdaptiveAvgPool2d (and then the layer will be printed). Backprop goes through all layers which output depends on at least one trainable parameter, and backbone is trainable.

njan-creative commented 4 years ago

Thanks for the reply.

I did not recognize that. I was thinking that though the weights are not changed, they still play a role in back propagation like constant multiplier magnifying or reducing the effect of loss.

One more question. How are the parameters for the learning rate chosen.

In fastai library, they use a learning rate finder. Is it really helpful ? Is there any easier way to search for right learning rate parameter.

The learning rate finder is described in the paper "Cyclical Learning Rates for Training Neural Networks" by Leslie Smith.
https://github.com/davidtvs/pytorch-lr-finder

maciej-sypetkowski commented 4 years ago

I was thinking that though the weights are not changed, they still play a role in back propagation like constant multiplier magnifying or reducing the effect of loss.

I don't know if I understand you correctly, but backpropagation is used to calculate gradients. If the layer and layers on which that layer depends on, don't require gradient, there's no point in calculating it. For example, the input doesn't require gradient, so backprop doesn't go though it, but obviously calculated gradients for layers depend on the input.

How are the parameters for the learning rate chosen.

The learning finder can be useful. Similar method would be to try a few learning rates and train for ~100 iterations or so, and compare loss after it. But both methods are not perfect. Probably the only way to find really optimal learning rate is to launch many full experiments with different learning rates and select the best one, but this is very time consuming and used only if you have enough computing power and want to squeeze out the last percents of the model. I didn't do this, as I had many other ideas I wanted to try. Some optimizers (e.g. Adam) makes selecting learning rate easier as are less sensitive to different learning rates (comparing to e.g. SGD with momentum). But have in mind that SGD can be superior to Adam in terms of accuracy, because the best training pipelines on e.g. Imagenet or COCO still uses SGD as it performs better, so it's a trade-off between time spent on hyperparameter search and the final accuracy. However, for practical uses, such accuracy boost is often not very significant and may be not worth it.

njan-creative commented 4 years ago

Thanks for the reply. Few more questions. Sorry to disturb you again.

  1. Is there any particular range that we should search to get a good learning rate ?

  2. I had read that the first layer is Global Average Pooling so that any size of images can be given as input. Similarly are there any general practices regarding the layers after the backbone. How it should be constructed or number of layers other than the one layer fully connected layer that maps from output features to the number of required classes. The one in this solution is slightly different from the previous solution that was quoted. https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/78109

  3. Regarding the size of the embedding layer, is there any reason behind selecting 1024 ? In the previous solution that you quoted.. it as 2048 and then 1024. https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/78109 Is there any reason for selecting this value ?

maciej-sypetkowski commented 4 years ago
  1. No, I don't think so. It depends strongly on the optimizer, the architecture, the data, and the problem. For sure the lower limit is when the model doesn't change, and the upper limit is when the model diverges (i.e. loss increases). The function learning rate -> accuracy should be more or less a bitonic function (i.e. first increases, and after the optimal learning rate, decreases).
  2. Also one more important reason for global pooling is that the size of the tensor containing feature maps is reduced, and it's harder to overfit. Instead of global average pooling, sometimes global max pooling or generalized pooling (GeM) are used (I also experimented with those during the competition, also trying to combine them at the same time). I'm not aware of any studies about the part after global pooling (called sometimes the neck). In almost all architectures for image classification, right after the pooling layer, there is a linear layer that transforms tensor directly to output class logits. One of the reason why I used the neck, is because beginning of the neck is the perfect place where I could concatenate a cell type. During competition I did a few experiments and figured out that the depth of the neck should be something around ~2.
  3. The same, just a few experiments. Smaller embeddings (e.g. 128) performed worse, and larger were similar.

Also if I recall correctly I had a few models with the different neck depth and embedding size in my final ensemble (which didn't change anything at the end).

njan-creative commented 4 years ago

Thanks for the clear explanations.