Closed cjt222 closed 4 years ago
No the idea proposed in the paper is correct. It is the relative loss which matters. That is if it is a hard example(p<0.5) it should contribute more to the loss, and if it is a easy example(p>0.5) it should contribute less to the loss. For example: p=0.3 for hard example p=0.7 for easy example gamma=0.5 hard_loss=(1-p)=1-0.3=0.7 sqrt(0.7)=0.83 easy_loss=(1-p)=1-0.7=0.3 sqrt(0.3)=0.54 hard_loss>easy_loss Another example: p=0.3 for hard example p=0.7 for easy example gamma=2.0 hard_loss=(1-p)=1-0.3=0.7 sq(0.7)=0.49 easy_loss=(1-p)=1-0.7=0.3 sq(0.3)=0.09 hard_loss>easy_loss
Here gamma controls the relative difference between the losses. If the dataset set is highly skewed choose higher value of gamma. (Check this out. This shows how different gamma values will impact loss during training)
In sequence tasks like OCR/OMR we have to distribute the p over many symbols, but the sum of probabilities remains 1 (sum(p)=1). Therefore we select lower gamma value(but greater than 0) as long as we maintain the relative loss.
Inshort, gamma(>0) controls the relative loss we want to impose on easy and hard samples.
if a>b then a^n > b^n holds true as long as n>0.
Image Source: Focal Loss for Dense Object Detection. (https://arxiv.org/pdf/1708.02002.pdf)
In fact, if gamma less than one, if p values is bigger which means this is easy example, and (1-p)gamma will result a bigger weight for ctc loss, it is different from idea of focal loss****