Open vellrya opened 1 year ago
I haven't managed to interpret the expression part yet.
But in my opinion, for the attributes, subtracting the second value from the first value, and considering the current attribute as positive if it is lower than a threshold (since when the attribute does not exist, this value will be lower) works in my case.
Moreover, if you use different distance functions (such as cosine distance) instead of L2 distance, recognition part should work better.
@ozy5, have you found a way to interpret these results also?
Attractive [ 0.30185086 -0.32830107] Blurry [ 5.2648306 -5.168624 ] Chubby [ 1.1931182 -1.1715137] Heavy Makeup [ 4.1857896 -4.172186 ] Gender [-2.0993235 3.6529565] Oval Face [ 1.1746042 -1.1556231] Pale Skin [ 4.8251357 -4.7965164] Smiling [ 0.8127083 -0.83256656] Young [0.39447686 0.88655436] Bald [ 4.4368887 -4.233571 ] Bangs [ 1.8084879 -1.8013161] Black Hair [-1.4930688 1.4686595] Blond Hair [ 8.256912 -8.360267] Brown Hair [ 4.6946564 -4.5729666] Gray Hair [ 6.8807154 -6.7179914] Receding Hairline [ 2.5174565 -2.5719535] Straight Hair [ 0.973975 -1.034498] Wavy Hair [ 1.2737968 -1.3484771] Wearing Hat [ 0.42250022 -0.4718365 ] Arched Eyebrows [ 1.0773654 -1.0389736] Bags Under Eyes [ 0.52596474 -0.35968357] Bushy Eyebrows [-1.045578 0.99792176] Eyeglasses [-0.35915434 0.3839317 ] Narrow Eyes [ 0.965227 -0.9930098] Big Nose [-0.82610035 0.73127556] Pointy Nose [-0.83083063 0.8537577 ] High Cheekbones [ 2.469345 -2.5633147] Rosy Cheeks [ 3.9973772 -3.9047399] Wearing Earrings [ 1.7754537 -1.8974233] Sideburns [-1.7116044 1.7752024] Five O'Clock Shadow [ 3.2272587 -3.257749 ] Big Lips [ 0.9422828 -0.9017367] Mouth Slightly Open [ 0.06529593 -0.02677619] Mustache [ 2.8427913 -2.8438246] Wearing Lipstick [-0.01849654 0.06970274] No Beard [ 0.7101219 -0.72317696] Double Chin [ 3.1949663 -3.3057327] Goatee [ 1.6698171 -1.6297672] Wearing Necklace [ 3.179865 -3.1018584] Wearing Necktie [ 5.181036 -5.1268277] Expression [-3.866186 -6.1388497 -7.3162904 -2.2862282 -1.2294844 -2.289443 3.3412638]
Warning: This is not the official interpretation of the outputs, and may be incorrect. But I think you can make use of the model with the things I mentioned below
@vijay-progenesis I haven't figured out the expression part, but for the other, you can subtract the second value from the first value.
example: Gender [-2.0993235 3.6529565] Black Hair [-1.4930688 1.4686595] Blond Hair [ 8.256912 -8.360267]
Make the operation:
I interpret this results as:
This "high" and "low" terms should be calibrated using a dataset highly related to your task. You should manually observe the outputs and determine the threshold to classify.
Additionally, maybe you can apply sigmoid function to the results and treat it as the probabilities (just an idea)
So, if interpreting the outputs in this way is correct, then:
Thank you for the detailed explanation, @ozy5.
you're welcome ^^ @vijay-progenesis
SwinFace provides logits for 40 binary attribute classification tasks, for example: Young [0.39447686 0.88655436]. These two values represent the logits for 'No' and 'Yes' respectively. During the inference stage, the softmax function needs to be applied to calculate the probabilities for 'No' and 'Yes'. Specifically, in the case of Gender, these two values represent the logits for 'Female' and 'Male' respectively. Similarly, SwinFace also provides logits for 7-class facial expression recognition. The seven facial expressions are Surprise, Fear, Disgust, Happy, Sad, Anger and Neutral. Softmax function is also needed to compute the probabilities for each class.
Thank you for the research you have done. I'm trying to figure out how the results of the neural network output should be interpreted.
For example, I am processing pre-aligned images 1 (top) and 2 (bottom) via MTCNN and I get the following:
For top picture: Smiling [ 0.5583181 -0.4514365] //first value > second, first is positive, second is negative For bottom picture: Smiling [ 1.9962342 -1.9506302] //same here, first value > second, first is positive, second is negative (if these are points, they are in the same coordinate quarter)
What should be done with the two values in each group that the neural network produces? Is it possible to estimate that a person is smiling more in one photo than in another, for example, or can the results of the neural network only answer a yes/no question (and how to calculate this based on the two values)?
Also the L2 distance between these faces (same person) is 1.2541978, which is quite a lot. For example, the distance between the two photos below (they are different people), is 1.1456171
I understand that neural networks cannot be required to have 100% accurate results, just noted this fact while experimenting