timesler / facenet-pytorch

Pretrained Pytorch face detection (MTCNN) and facial recognition (InceptionResnet) models
MIT License
4.49k stars 945 forks source link

Bad performance for embedding distance - will training help? #138

Open j-adamczyk opened 3 years ago

j-adamczyk commented 3 years ago

I have used MTCNN to extract faces on a news video. I want to identify the face of Donald Trump. My images: https://www.dropbox.com/s/k137md6qyvu2kbf/images.zip?dl=0.

My approach: 1) Detect faces with MTCNN, both for sample images and on video. 2) Align faces with dlib and 5-point detector (since MTCNN "align" is just rotation with no alignment whatsoever). 3) Calculate embeddings with InceptionResNetV1, for 10 sample images an average embedding is calculated. 4) On video calculate embedding for each detected face and check the distance with the Donald Trump average embedding; if it's close enough, mark as D.T.

Problem: really bad performance. For example see the TV presenter face and D.T. face in the Dropbox file: for cosine similarity, the similarity is ~90-98% (in comparison with the average D.T. embedding) for both of them; for L1 and L2 distance, they are also almost the same. This means that embedding does not really distinguish between those faces.

1) Will retraining the network for lower dimensionality, i.e. 128 instead of 512, help? The curse of dimensionality is probably really bad for so many dimensions. 2) Another idea I have to help with this: create a dataset with 10 embeddings of D.T. calculated with InceptionResNetV1 and 10 embeddings of random faces (e.g. from casia-webface) and train the classifier on those. Should this help, or the network just can't discriminate between those faces?

AGenchev commented 3 years ago

I support your opinion that 512-D is too much and we need the lower-D option. Also the InceptionResNets got improvements over v1, Inception-ResNet-v2 for example could be tried to improve the lower dimensional matching. Probably if face mirroring is turned off during training it would allow for higher accuracy as some faces are non-symmetrical.

balance231 commented 3 years ago

I think in this repo, the writer just learn the feature through training classification task. The feature logits are not discriminative, thus are hard to distinguish different class through distance, especially samples out of the previous classes. You would better train the network by applying triplet loss.