uqzhichen / SDGZSL

[ICCV2021] Official Pytorch implementation for SDGZSL (Semantics Disentangling for Generalized Zero-Shot Learning)
36 stars 8 forks source link

Question about optimizing the discriminator #6

Closed tbw19970424 closed 2 years ago

tbw19970424 commented 2 years ago

I wanted to pair the TC and Ldis described in your paper with the tc_loss in your code. I felt confused about the discriminator in your code. Why does the output of the discriminator have two dim? Furthermore, there's no log operation in your code for tc_loss.

uqzhichen commented 2 years ago

Hi Bowen,

Firstly, thank you so much for being interested in this work!

For the dimension of the discriminator, as you find that the output of the discriminator is two-dimensional, which makes the discriminator actually a two-class classifier. You can also see that we used a cross-entropy loss to optimize the discriminator. To be more specific, we make the latent h to be class 0, and the h after permutation to be class 1. When optimizing the discriminator, the output of discriminator when feeding h before permutation is supposed to be (0,1) and the h after permutation is supposed to be (1,0). However, when optimizing the disentangling modules, we want to fool the discriminator, the loss is the (s_score[:, :1] - s_score[:, 1:]).mean(), which makes the output close to (0.5,0.5). For the log, it is omitted in the code.

Feel free to contact me, if you have more questions about this work!

Cheers, Zhi

tbw19970424 commented 2 years ago

Thank you so much for your meticulous explanation. SDGZSL is really a good work and inspires me a lot. I've understood the adversarial training phase. But I still have some questions about some details while training your methods. Firstly, I find that the criterion for mse loss in VAE part is used as reduction='sum' while in disentangling AE part and TC loss it is used as reduction='mean'. For this reson, compared with the ELBO lower bound for VAE the loss used for disentanglement is much smaller in scale during training. Why should we design like this? Furthermore, what if we compare all attribute vectors when learning the relation network but not only those appeared in the batch? The last question is that when training the final classifier, why is the seen accuracy continues to decade and the harmony accuracy reaches its best at the first few epoches. From common sense if the synthetic unseen features are good enough, the final softmax classifier should converge in the last epoches. I also find the phenomenon in CADA-VAE but I don't know how to explain it.

uqzhichen commented 2 years ago

Hi Bowen,

For the MSE loss, in the disentangling modules, the training is quite sensitive to the loss magnitude. Thus, we used the mean reduction.

For the relationNet, we actually tried to compare with all semantic vectors instead of those in a single batch. But it ends up with bad generalization ability. Exposing to different semantic vectors during training, the model is converged being more robust.

For the final classifier, you can check the number of the synthesized samples, which is much higher than the seen class samples. It causes the imbalanced issue during training the final classifier. At the begining, as the final classifier is exposed to more unseen class samples, the unseen metric is usually very high. While training more epochs, the model has been exposed to more seen class samples, resulting in bias towards seen classes. You may also realize this problem in end-to-end ZSL methods and they use a calibration rate to balance the performance between seen and unseen classes.

Feel free to add my Wechat (ThatChenZhi), if you want instant discussion.

Cheers, Zhi