Closed seekingup closed 3 years ago
Hello, the main difference of the two kind of Softmax is that whether it modifies the logit during training stage or not. The result from PC Softmax is that it lets you know that model can learn the representation well even the model is exposed on the imbalanced distribution. The previous SOTA works including Balanced Softmax tries to modify the training scheme to help the representation learning, however PC Softmax shows a comparable or even better performance on the long-tailed benchmarks by just modifying the logits during inference properly.
Note that the gradient of PC softmax is same with the vanilla softmax (because the only difference comes from the inference time) but different with the gradient of Balanced Softmax.
Note that the gradient of PC softmax is same with the vanilla softmax (because the only difference comes from the inference time) but different with the gradient of Balanced Softmax.
Yes, the gradient is different. But I mean that even if the gradient is different, the output logit should be the same.
The output of logit+logS
of balanced softmax and logit
of PC softmax are expected to be the same with same network and same loss function (CE with modified logit).
For example, if logS =[-0.1, -1.0, -3.0]
and balanced softmax learns a logit [0, 4, 1]
, the output of balanced softmax would be [-0.1, 3.0, -2.0]
.
Correspondingly, PC softmax will directly learns [-0.1, 3.0, -2.0]
beacause it share the same loss function with balanced softmax(CE). That is,
In this situation, balanced softmax seems to learn a residual logit
of PC softmax.
I guess the performance different comes from:
But anyway, PC softmax is definitely more flexible than balanced softmax because it only modifies the logits during inference. Thanks for your reply. ^_^.
@seekingup I agree with your observation, especially on the third point; we were also a bit confused when the experiment results came out. Apparently modifying the logits (ex. Balanced Softmax, LADE) impacts the performance bigger than we expected. I do want to emphasize on the strength of the vanilla softmax itself though! We were quite surprised that the PC Softmax's performance was this good, without any bells and whistles for the training. I think the research community can profit more by tweaking the vanilla softmax this way or that, like Temperature Scaling. We also encountered some randomness in the experiments, but haven't got the resources to explore this further (https://github.com/hyperconnect/LADE/issues/4). It'll be a great extension to further stabilize the training procedure of LADE.
Hi, thanks for your inspiring work. I have a small question about the PC Softmax after reading the paper.
In this paper, the logits of PC softmax is
logit - logS + logT
(Eq.4, was writen a bit casual, hope you can understand it). Thus it should be trained with standard CE, and add-logS + logT
during inference. For the proposed LADE (or balanced softmax when alpha=0), the model was trained like:CrossEntropy(logit + logS)
.I compared balanced softmax bellow. taken the class-balanced test set as example (where logT = const):
In this table, Balanced-Softmax and PC-Softmax should be equivallent. In my opinion, Balanced Softmax learns
logit + logS
and PC-Softmax learnslogit
, and they should equal. ThelogS
in Balanced Softmax seems like a "residual connection".However, there is some performance difference between them as shown in paper. My experiments on CIFAR100-LT and ImageNet-LT (ResNet50) also shows the difference. Balanced Softmax is 0.9% higher on ImageNet-LT but PC Softmax is 1.3% higher on CIFAR100-LT. That confused me... Could you please share some opinion on the difference of the two kind of Softmax? Looking forward to your reply~