Closed ethanyanjiali closed 4 years ago
This is true -- the supervision is somewhat "sparse". I think empirically, you do see most outputs being very close to 0 most of the time -- especially early in training. As training progresses, optimizing loss requires getting predictions closer and closer to 1 on confident pixels. If it is helpful, you can think of classification on Imagenet as having some similarity: 1000-way classification only has one correct score, though as training progresses networks can learn to have very high confidence for the correct predictions!
This paper was not the first to use the heatmap supervision approach, by the way. Perhaps earlier work could explain more detail.
thanks for the reply! it might be easier to converge on multi-class because all of them will get activation at least from some images, and also cross-entropy punish misclassification more. the MSE used in the paper doesn't seem to be too sensitive here.
for me, i ended up assigning more weights for foreground pixel and scale up the gaussian value. do you see the loss value dropped constantly without adding any trick to the loss function?
No problem. "cross-entropy punish misclassification more" -- I used this more of an analogy, though, yes, it isn't a perfect comparison. Nevertheless, MSE error does still punish in both directions.
This loss function did train well as-is. Of course, it is always possible it could be modified to train faster!
Thanks. Maybe I should try to run it longer when I apply vanilla MSE for loss. I'm closing this issue now.
In the predicted heatmap, let's say 64x64, you will have 4096 pixels in total, however, only 7x7 pixels (or 9x9 depends on implementation) are gaussian foreground, and all the remaining are just zero. Without any techniques to balance, the network could learn the trivial solution really quickly by producing an all-zero heatmap. I don't see anything to address this in either the code or the paper, would you mind to share some thoughts on this? Thanks.