a small question about PC-Softmax.

seekingup commented 3 years ago

Hi, thanks for your inspiring work. I have a small question about the PC Softmax after reading the paper.

In this paper, the logits of PC softmax is logit - logS + logT (Eq.4, was writen a bit casual, hope you can understand it). Thus it should be trained with standard CE, and add -logS + logT during inference. For the proposed LADE (or balanced softmax when alpha=0), the model was trained like: CrossEntropy(logit + logS).

I compared balanced softmax bellow. taken the class-balanced test set as example (where logT = const):

method	train logits	test logits	train - test offset
balanced softmax	logit + logS	logit + logT	- logS + logT
PC softmax	logit	logit - logS + logT	-logS + logT

In this table, Balanced-Softmax and PC-Softmax should be equivallent. In my opinion, Balanced Softmax learns logit + logS and PC-Softmax learns logit, and they should equal. The logS in Balanced Softmax seems like a "residual connection".

However, there is some performance difference between them as shown in paper. My experiments on CIFAR100-LT and ImageNet-LT (ResNet50) also shows the difference. Balanced Softmax is 0.9% higher on ImageNet-LT but PC Softmax is 1.3% higher on CIFAR100-LT. That confused me... Could you please share some opinion on the difference of the two kind of Softmax? Looking forward to your reply~

wade3han commented 3 years ago

Hello, the main difference of the two kind of Softmax is that whether it modifies the logit during training stage or not. The result from PC Softmax is that it lets you know that model can learn the representation well even the model is exposed on the imbalanced distribution. The previous SOTA works including Balanced Softmax tries to modify the training scheme to help the representation learning, however PC Softmax shows a comparable or even better performance on the long-tailed benchmarks by just modifying the logits during inference properly.

juice500ml commented 3 years ago

Note that the gradient of PC softmax is same with the vanilla softmax (because the only difference comes from the inference time) but different with the gradient of Balanced Softmax.

seekingup commented 3 years ago

Note that the gradient of PC softmax is same with the vanilla softmax (because the only difference comes from the inference time) but different with the gradient of Balanced Softmax.

Yes, the gradient is different. But I mean that even if the gradient is different, the output logit should be the same. The output of logit+logS of balanced softmax and logit of PC softmax are expected to be the same with same network and same loss function (CE with modified logit). For example, if logS =[-0.1, -1.0, -3.0] and balanced softmax learns a logit [0, 4, 1], the output of balanced softmax would be [-0.1, 3.0, -2.0]. Correspondingly, PC softmax will directly learns [-0.1, 3.0, -2.0] beacause it share the same loss function with balanced softmax(CE). That is, In this situation, balanced softmax seems to learn a residual logit of PC softmax.

I guess the performance different comes from:

randomness of experiments
with the modified logits of balanced softmax, the two softmax are then initialized differently.
though mathematically similar, the actually learning process would be a bit different (like ResNet learns better features than its non-residual version).

But anyway, PC softmax is definitely more flexible than balanced softmax because it only modifies the logits during inference. Thanks for your reply. ^_^.

juice500ml commented 3 years ago

@seekingup I agree with your observation, especially on the third point; we were also a bit confused when the experiment results came out. Apparently modifying the logits (ex. Balanced Softmax, LADE) impacts the performance bigger than we expected. I do want to emphasize on the strength of the vanilla softmax itself though! We were quite surprised that the PC Softmax's performance was this good, without any bells and whistles for the training. I think the research community can profit more by tweaking the vanilla softmax this way or that, like Temperature Scaling. We also encountered some randomness in the experiments, but haven't got the resources to explore this further (https://github.com/hyperconnect/LADE/issues/4). It'll be a great extension to further stabilize the training procedure of LADE.

hyperconnect / LADE

a small question about PC-Softmax. #9