lxq1000 / SwinFace

Official Pytorch Implementation of the paper, "SwinFace: A Multi-task Transformer for Face Recognition, Facial Expression Recognition, Age Estimation and Face Attribute Estimation"
MIT License
65 stars 7 forks source link

How to interpret the results? #1

Open vellrya opened 1 year ago

vellrya commented 1 year ago

Thank you for the research you have done. I'm trying to figure out how the results of the neural network output should be interpreted.

For example, I am processing pre-aligned images 1 (top) and 2 (bottom) via MTCNN and I get the following:

For top picture: Smiling [ 0.5583181 -0.4514365] //first value > second, first is positive, second is negative For bottom picture: Smiling [ 1.9962342 -1.9506302] //same here, first value > second, first is positive, second is negative (if these are points, they are in the same coordinate quarter)

1_cr 2_cr

What should be done with the two values in each group that the neural network produces? Is it possible to estimate that a person is smiling more in one photo than in another, for example, or can the results of the neural network only answer a yes/no question (and how to calculate this based on the two values)?

Also the L2 distance between these faces (same person) is 1.2541978, which is quite a lot. For example, the distance between the two photos below (they are different people), is 1.1456171

1_cr 2_cr

I understand that neural networks cannot be required to have 100% accurate results, just noted this fact while experimenting

uozyurt commented 7 months ago

I haven't managed to interpret the expression part yet.

But in my opinion, for the attributes, subtracting the second value from the first value, and considering the current attribute as positive if it is lower than a threshold (since when the attribute does not exist, this value will be lower) works in my case.

Moreover, if you use different distance functions (such as cosine distance) instead of L2 distance, recognition part should work better.

vijay-progenesis commented 5 months ago

@ozy5, have you found a way to interpret these results also?

Attractive [ 0.30185086 -0.32830107] Blurry [ 5.2648306 -5.168624 ] Chubby [ 1.1931182 -1.1715137] Heavy Makeup [ 4.1857896 -4.172186 ] Gender [-2.0993235 3.6529565] Oval Face [ 1.1746042 -1.1556231] Pale Skin [ 4.8251357 -4.7965164] Smiling [ 0.8127083 -0.83256656] Young [0.39447686 0.88655436] Bald [ 4.4368887 -4.233571 ] Bangs [ 1.8084879 -1.8013161] Black Hair [-1.4930688 1.4686595] Blond Hair [ 8.256912 -8.360267] Brown Hair [ 4.6946564 -4.5729666] Gray Hair [ 6.8807154 -6.7179914] Receding Hairline [ 2.5174565 -2.5719535] Straight Hair [ 0.973975 -1.034498] Wavy Hair [ 1.2737968 -1.3484771] Wearing Hat [ 0.42250022 -0.4718365 ] Arched Eyebrows [ 1.0773654 -1.0389736] Bags Under Eyes [ 0.52596474 -0.35968357] Bushy Eyebrows [-1.045578 0.99792176] Eyeglasses [-0.35915434 0.3839317 ] Narrow Eyes [ 0.965227 -0.9930098] Big Nose [-0.82610035 0.73127556] Pointy Nose [-0.83083063 0.8537577 ] High Cheekbones [ 2.469345 -2.5633147] Rosy Cheeks [ 3.9973772 -3.9047399] Wearing Earrings [ 1.7754537 -1.8974233] Sideburns [-1.7116044 1.7752024] Five O'Clock Shadow [ 3.2272587 -3.257749 ] Big Lips [ 0.9422828 -0.9017367] Mouth Slightly Open [ 0.06529593 -0.02677619] Mustache [ 2.8427913 -2.8438246] Wearing Lipstick [-0.01849654 0.06970274] No Beard [ 0.7101219 -0.72317696] Double Chin [ 3.1949663 -3.3057327] Goatee [ 1.6698171 -1.6297672] Wearing Necklace [ 3.179865 -3.1018584] Wearing Necktie [ 5.181036 -5.1268277] Expression [-3.866186 -6.1388497 -7.3162904 -2.2862282 -1.2294844 -2.289443 3.3412638]

uozyurt commented 5 months ago

Warning: This is not the official interpretation of the outputs, and may be incorrect. But I think you can make use of the model with the things I mentioned below

@vijay-progenesis I haven't figured out the expression part, but for the other, you can subtract the second value from the first value.

example: Gender [-2.0993235 3.6529565] Black Hair [-1.4930688 1.4686595] Blond Hair [ 8.256912 -8.360267]

Make the operation:

I interpret this results as:

This "high" and "low" terms should be calibrated using a dataset highly related to your task. You should manually observe the outputs and determine the threshold to classify.

Additionally, maybe you can apply sigmoid function to the results and treat it as the probabilities (just an idea)

So, if interpreting the outputs in this way is correct, then:

vijay-progenesis commented 5 months ago

Thank you for the detailed explanation, @ozy5.

uozyurt commented 5 months ago

you're welcome ^^ @vijay-progenesis

lxq1000 commented 5 months ago

SwinFace provides logits for 40 binary attribute classification tasks, for example: Young [0.39447686 0.88655436]. These two values represent the logits for 'No' and 'Yes' respectively. During the inference stage, the softmax function needs to be applied to calculate the probabilities for 'No' and 'Yes'. Specifically, in the case of Gender, these two values represent the logits for 'Female' and 'Male' respectively. Similarly, SwinFace also provides logits for 7-class facial expression recognition. The seven facial expressions are Surprise, Fear, Disgust, Happy, Sad, Anger and Neutral. Softmax function is also needed to compute the probabilities for each class.