Open harshraj22 opened 2 years ago
For pretraining step, it might be possible that one of the 4 crops for each data point contains large image. And the model's prediction on this crop dominates over the prediction on rest 3 crops. This may inhibit the model to learn features corresponding to small digits present in the image.
The idea is to calculate the loss corresponding to each crop (rather than taking mean of logits), and update the model only with the largest loss, as the model performed miserable corresponding to that crop. This is quite similar to what Hard Sample Mining does. This update will be noisy in case when one of the crops does not contain any digit (though such number of samples will be less in number). To reduce the noise further, one may consider second largest loss. or so on.