Closed lumliolum closed 5 years ago
No, It doesn't give equal weight to all the examples.
The focusing parameter γ(gamma) smoothly adjusts the rate at which easy examples are down-weighted. When γ = 0, focal loss is equivalent to categorical cross-entropy, and as γ is increased the effect of the modulating factor is likewise increased (γ = 2 works best in experiments).
α(alpha): balances focal loss, yields slightly improved accuracy over the non-α-balanced form.
I suggest you to read the paper much better ;-)
In the paper, the balanced form has alpha_t(1-pt)^(gamma)(log(pt)).
I am saying that in the equation it is alpha_t not alpha meaning the alpha_t is different for each example and not a constant. In the above section(balanced cross entropy) there also alpha_t was different for each example. I think they are saying that when we use weighted focal loss we get slightly better accuracy.
Note: Another thing I want to mention is alpha = 1 and alpha = 0.25 doesn't make any difference because you are just scaling the loss function and the optimal weights of the model will be the same for both the cases then how can it give better accuracy ?
For example, in Binary case alpha is the weighting factor, 1 for class 1 and 1-alpha for class 0, so alpha balances the importance of positive/negative examples. So you have to choose only an alpha value.
I saw the code and thought that you are multiplying with alpha with whole equation but you are multiplying alpha and 1-alpha.
My bad !!
thanks for the reply.
Can I define multiple alphas on the multi-class problem?
In focal paper, it says
In practice α may be set by inverse class frequency or treated as a hyperparameter to set by cross validation.
So for each class, I guess you compute occurrence in training set and do the inverse.
How to set α by inverse class frequency? Is it like class_weights = dict(zip(np.unique(y_train), class_weight.compute_class_weight('balanced', np.unique(y_train), y_train))) ?
From the paper, alpha's are weights for each example. So why alpha=0.25 is kept? Does this mean giving equal weight to all the examples?
I may be wrong but this what I understood from the paper.