I can't achieve the accuracy in bench mark, could somebody help?

peteryuX / arcface-tf2

ArcFace unofficial Implemented in Tensorflow 2.0+ (ResNet50, MobileNetV2). "ArcFace: Additive Angular Margin Loss for Deep Face Recognition" Published in CVPR 2019. With Colab.

MIT License

262 stars 60 forks source link

I can't achieve the accuracy in bench mark, could somebody help? #38

Open GranMin opened 3 years ago

GranMin commented 3 years ago

output19 42 I use the same train dataset and test dataset as you proposed, but the best result I've got so far is as the picture shows. I used SGD optimizer and lr=0.1,0.05,0.01,0.0001,0.00001, each lr an epoch. And when I found the loss increasing rather than decreasing, I stoped training. And I got the test result for loss 19.42 as up picture. More, this is test result when the train loss is 21.15, shown as down picture. output21 15

Androsimus commented 3 years ago

@GranMin what backbone do you use? I tried MobileNetV2 and got similar to author's results. I used constant learning rate = 0.01 during about 10 epochs and achieved loss about 10. I think you should try similar to my lr schedule at first and decrease lr only after that (and maybe not so fast). Otherwise your model haven't enough time to use relatively big gradients to decrease loss.

GranMin commented 3 years ago

@Androsimus I used the Resnet50. I will try and reply as soon. Thanks for your advice.

GranMin commented 3 years ago

the newest info: I use the Resnet50, SGD optimizer and set lr=0.01, after 30 epoch, I reached 99.08 in lfw, and 92 in AgeDB-30, as shown follow.

Androsimus commented 3 years ago

@GranMin Maybe use of 30 epochs is too much and you have got some overfitting?.. Do you have previous checkpoints from 10-15 epochs? If yes, try to check them on val datasets.

GranMin commented 3 years ago

@Androsimus I have the same feeling of too much epochs.But the loss of 10-15 epoch is about 20, and the acc of lfw is about 98.

Androsimus commented 3 years ago

@GranMin This is very strange. Maybe you changed some other parameters? Maybe parameters of Arcface: margin, scale? Because there are other issues, where persons wrote about good results using Resnet50 on native for this repository dataset.

GranMin commented 3 years ago

@Androsimus I don't change any other parameters. And I tried NormHead for one epoch, then use the Archead, it's amazing that just after one epoch in Archead, the loss comes to about 11. But then the same phenomenon took place: the loss increase a few at the begin of the epoch, and then decrease, but at very low speed. Like this:

Androsimus commented 3 years ago

@GranMin I'm not sure how NormHead is supposed to use. Maybe as a warmup. But the NormHead and the ArcHead are completely different. As far as I understand, NormHead is for ordinary classification, if so, then classification problem is much easier and due to this you achieve low loss much faster. Otherwise, when you use ArcHead with its margin and scale parameters and its different concept the classification problem becomes harder. But I don't understand your situation: on the one hand you said about loss ~20 on 10-15 epochs, on the other hand after first epoch you had loss <11 and then it decreased at low speed...

To sum up. I used strictly ArcHead, I suppose other people did it too. So I propose to try using only ArcHead.

GranMin commented 3 years ago

@Androsimus In fact, I tried two times about training the model recently. The first time, I use only ArcHead and train at lr=0.01 with SGD for 10-15 epochs, loss ~20. Finally, I trained about 30 epochs to get a loss ~6.5 and accuracy 99.06 on lfw. The second time, I tried the NormHead for one epoch and the changed to Archead, after one epoch with ArcHead, loss down to 10. But as described latest, the speed come down. As for the difference of two head in math is that NormHead just use softmax to ensure correct classification, it work not so well on boundary between classes. And Archead forces a theta between two classes, to avoid two classes adjoin with each other.

Androsimus commented 3 years ago

@GranMin If you look at this post and https://github.com/peteryuX/arcface-tf2/issues/4#issuecomment-599015569 and thread, then you will see big differences from your results. Strange. Nevertheless, if you have your old logs and chechpoints from training using only ArcHead, then try to find checkpoint that correspond to loss about 8-9 and try it on validation datasets.

GranMin commented 3 years ago

results as: first for loss 9.16 and second for loss 8.00

Androsimus commented 3 years ago

@GranMin this is some mystery )

Androsimus commented 3 years ago

@GranMin could you post your config file *.yaml? Anyway if you figure out a reason of that strange model training behavior, please write about it.

Androsimus commented 3 years ago

@GranMin There is one idea. For correct inference model must be used as model(input, training=False) The author didn't use it for some reason. So you can try to add training=False in modules/evaluations.py to _performval function.

GranMin commented 3 years ago

@Androsimus Sorry for long waiting. Hahah...I just took a vacation to Jiuzhai Gou nature reserve last week. I talk to my teacher and then I know they broaden my dataset with some asian faces. I redown the dataset the author provided, and got this result just for 2 epoch, just use arcface head. I decide to train resnet152 from scratch following time. And the experience that use softmax first may also help. Again, thank you for your advice~best wishes, my friend!

GranMin commented 3 years ago

And for 5 epoch finished, this is final result:

Androsimus commented 3 years ago

@GranMin glad you got nice results :) Best wishes!

xalbertoisorna commented 10 months ago

@Androsimus In fact, I tried two times about training the model recently. The first time, I use only ArcHead and train at lr=0.01 with SGD for 10-15 epochs, loss ~20. Finally, I trained about 30 epochs to get a loss ~6.5 and accuracy 99.06 on lfw. The second time, I tried the NormHead for one epoch and the changed to Archead, after one epoch with ArcHead, loss down to 10. But as described latest, the speed come down. As for the difference of two head in math is that NormHead just use softmax to ensure correct classification, it work not so well on boundary between classes. And Archead forces a theta between two classes, to avoid two classes adjoin with each other.

How do you change head (normhead to archead)? I tried it but I have this error: raise ValueError( ValueError: Cannot assign value to variable ' conv2d/bias:0': Shape mismatch.The variable shape (24,), and the assigned value shape (32,) are incompatible.