sample_per_class in balanced_softmax_loss

milliema commented 3 years ago

Thanks for the awsome work! I have one question about the parameter sample_per_class. According to the paper, this should be the number of images in each class. If the sample_per_class for certain classes are very high, and even after log operation it is still much larger than the prediction logit. In that case, in the operation of "logits = logits + spc.log()", will spc overwhelm logits? My concern is with face application, where all the logits are within range (-1, +1). Is balanced_softmax_loss applicable for my case?

jiawei-ren commented 3 years ago

Thanks for the question.

In that case, in the operation of "logits = logits + spc.log()", will spc overwhelm logits

No. A very large sample_per_class will not overwhelm the logit as long as sample_per_class for other classes are in a comparable scale. In the following Softmax, e^(logit_0 + spc_0.log()) / (e^(logit + spc.log())).sum() is equal to e^(logit_0) / (e^(logit + (spc/spc_0).log())).sum(). There will not be an overwhelming as long as (spc/spc_0).log(), i.e., the log of ratio between two class frequencies, is in a reasonable range.

Nonetheless, it is still possible that a class has 1,000,000 times more samples than another class. However, since logits are defined in real number space (usually the output of a linear layer), a network can always learn to match the constant offset.

My concern is with face application, where all the logits are within range (-1, +1). Is balanced_softmax_loss applicable for my case?

Would you mind elaborating on this question, especially on the range of the logits? Is it a cosine similarity?

milliema commented 3 years ago

Thanks for the quick response!

A very large sample_per_class will not overwhelm the logit I agree with your explanation. And if we devide by the total number of instances in the entire training set in both denominator and numerator, then spc will become class frequency (0~1). Would you mind elaborating on this question, especially on the range of the logits? Is it a cosine similarity? Yes, it's actually cosine classifier, meaning in the FC operation, we normalize the input feature x and weight matrix w at first, so that wi and xi has unit norm. Then the i-th output logit =wixi=|wi||xi|*cos(theta) is within range of (-1,+1). But since spc is not an critical issue as you explained above, I guess it does not matter.

jiawei-ren commented 3 years ago

Great! I will close the issue for now, do feel free to reopen the issue if you have any further questions.

jiawei-ren / BalancedMetaSoftmax-Classification

sample_per_class in balanced_softmax_loss #10