Approach

Prototype-based Loss

The authors provide the equation for the softmax loss function for $N$ identities: $L = -\log \frac{e^{W_{y_i} \dot fi + b{yi}}}{\sum{j=1}^N e^{W_j \dot f_i + b_j}}$. After reformulating, they present the CosFace Loss: $L = -\log \frac{e^{s\cdot[{\cos(\hat f_i, \hat W_j) - m}]}}{e^{s\cdot[{\cos(\hat f_i, \hat Wj) - m}]} \sum{j \neq y_i}^N e^{s\cdot\cos(\hat f_i, \hat W_j)}}$ , where $m$ is the margin for improving the decision boundary and $\hat W$ is the prototype of all identities in the training dataset.

MixFair Adapter

They propose the MixFair Adapter to estimate the identity bias (introduced by races, genders, or other individual differences). This method is based on a mixing strategy, but no reference for it is given. They assume each feature map is comprised of two terms: the bias-free representation and the identity term: $f_i = r_i + b_i, \quad f_j = r_j + b_j$, where $r_i$ and $r_j$ are bias-free contour representations, while $b_i$ and $b_j$ are their corresponding identity biases.
Then, considering the mixed feature map of $f_i$ and $f_j$: $f_m = \frac{1}{2} (f_i + f_j)$, they show that when $f_i$ is a largely biased feature map (i.e., $|b_i| \gg |b_j|$), it is observed that the output of a non-linear layer $M$ tend to preserve more similar feature of $f_i$: $\cos(M(f_m), M(f_i))^2 - \cos(M(f_m), M(f_j))^2 = \epsilon > 0$, where $\cos$ indicates the cosine similarity function, and $\epsilon$ is the bias difference. They then infer which one of the feature map has a larger identity bias according to $\epsilon$.
- They don't show evidence for this observation or inference, but it makes sense.
  - In the context of face recognition, the bias-free contour representations for a positive pair (same individuals) would ideally be of similar direction. Similarly, in the context of kinship verification, for a positive pair (same family) we would have a similar direction, however I think in this context this 'assumption' is harder because of the higher intra-class variance.
    - What if we would have a "Deidentify layer" that removes the identity bias, preserving only the family traits?
  - Nonetheless, here are my justifications for their observation
    - As the feature maps are supposidely similar for positive pairs, $f_m$ will be more similar to the feature map with higher bias.
    - Non-linear layers as $M$ tend to amplify dominant features in their input while diminishing less significant ones, specially in the presence of activation functions like ReLU which introduces sparsity. When $M$ is applied, it is likely to retain the orientation of the dominant feature map due to its higher magnitude. This explains why $\epsilon$ would be positive if $f_i$ is a largely biased feature map.
    - The squared cosine similarity only accentuates the difference between the similarities, which highlights the bias disparity.
In the paragraph MixFair Adapter, they cite "two different identites". Is this layer not used for positive pairs?
By making $\epsilon \approx 0$, they claim both feature maps are not dominated by their own identity biases.

MixFairFace Framework

TODO -- need to understand why they use different identities in the loss function.

vitalwarley / research

Estudar #61 em busca de entender a razão do "MixFair Adapter" #65

Approach

Prototype-based Loss

MixFair Adapter

MixFairFace Framework