Questions about GST visualization

abcsimple commented 2 years ago

Hi Minchul, thx for this incredible work! I have few questions about the GST visualization in the Fig 3 of the main paper.

For the CosFace Loss(1st. column of Fig 3), it looks like GST value decreases rapidly near the boundary, how can you adjust the GST value from W_j to the boundary B_1, what is the value of s? I thought the result might be based on the last graph of Fig 1 of supplementary material, then +0.5(m=0.5) in x-axis and -1(P-1) in y-axis to get the function of GST based on cos_theta for CosFace, however it looks different in the 1st. column of Fig 3.
In the ArcFace Loss(2nd. column of Fig 3), we can see GST increases when cos_theta goes up. But according to the Eq. 15, when cos_theta goes up, |(P-1)| goes down and (cos(m)+...) goes up. How to make sure the GST value is positive correlated with cos_theta?
The idea emphasizes hard sample with high norm and easy sample with low norm. But the AdaFace Loss(7th. column of Fig 3) shows white triangle(hard sample, low norm) still has large GST value which doesn't make sense.

I'd appreciate it if you can solve my problem :p

mk-minchul commented 2 years ago

Hi abcsimple. Thanks for the question.

The figure 3 of the main paper is plotting the absolute value of GST term (eq.12). We plot with two class where the logit (cosine theta)for the non-groudtruth class (j) is varied and the logit for the groudtruth class (x) is changed. For cosface, it would be like the following. j=0.5, s=64, m is a margin a) x_m = x - m b) prob = np.exp(s x_m) / (np.exp(s x_m) + np.exp(sj) ) c) gst = (prob - 1) s where a) comes from cosface additive margin and c) comes from the supp. A, gradient equation.
Since gst term is the combination of the two terms, one must look at the combined effect of two terms. (p-1) makes sure that when p is close to 0, the gst term is close to 0. And the second term for arcface is responsible for the scaling effect with respect to the cosine theta. I thought an interactive exercise would be helpful here. So I prepared an interactive plot for you to play around. https://www.desmos.com/calculator/6hkxydcrqj
I would interpret the work as adaptively changing the margin function (as opposed to margin itself). So it is not about emphasizing hard sample with high norm and easy sample with low norm. It is about putting more emphasis for hard samples when the feature norm is high (comparison between black circle and black triangle in figure 3) and less emphasis for hard samples when the feature norm is low (comparison between white circle and while triangle in figure 3).

abcsimple commented 2 years ago

Hi Minchul, thank you so much for the detailed explanation!

I've plot a gst curve of AdaFace(z=1) in the same way you did here https://www.desmos.com/calculator/kvtsanlbze. It seems gst value increase rapidly when cos(\theta) close to 1, which can't be find out in the last plot of Fig 3 in the main paper. Can I say that the main idea is about putting more emphasis for hard samples when the feature norm is high and less emphasis for hard samples when the feature norm is low, but some special area can be ignored such as the area when z=1 and cos(theta) close to 1?

One more question, since the feature norm is obtained by the model trained from face image dataset, the norm value should be relevated with face recognizability, which turns to the same defination of image quality with MagFace. The paper proved the correlation score between the feature norm and IQ score is around 0.5235 at the final epoch. The paper shows the relationship by the correlation, without explanation, I guess the correlation score between the feature norm and SER-FIQ score could be higher than 0.5235, did you ever try the correlation with SER-FIQ before?

mk-minchul commented 2 years ago

Hi abcsimple.

The steep change near 1 you observe is something interesting which might cause unintended behavior for AdaFace. The intended behavior is to not change the sign near 1. Our recommended margin 0.4 has flat region around 1.
We have not tried with SER-FIQ score. It might be interesting to see what the correlation turns out to be. As for the definition if image quality and how we approximate with the feature norm, I agree that there is not a strong connection other than the correlation. As for the interpretation of the feature norm, MagFace chose to interpret it as a difficulty (face recognizability) and in AdaFace we chose to use it as a proxy for the image quality. The subsequent algorithm therefore changes because MagFace deliberately enforces the feature space to align the difficulty in both angular distance and feature norm space, resulting in a cone-like shape. AdaFace views difficulty (defined in angular distance) to be different from the feature norm, which is why we do not force the feature norm space to be linearly related to the angular space. Instead we interpret it as an image quality which we use to control the relative importance of samples during training. Hope this clears some of the questions :) And if you know of papers that provides a more rigorous explanation on the behavior of feature norm during softmax loss, please feel free to share with me :)

mk-minchul / AdaFace

Questions about GST visualization #24