Open GranMin opened 3 years ago
@GranMin what backbone do you use? I tried MobileNetV2 and got similar to author's results. I used constant learning rate = 0.01 during about 10 epochs and achieved loss about 10. I think you should try similar to my lr schedule at first and decrease lr only after that (and maybe not so fast). Otherwise your model haven't enough time to use relatively big gradients to decrease loss.
@Androsimus I used the Resnet50. I will try and reply as soon. Thanks for your advice.
the newest info: I use the Resnet50, SGD optimizer and set lr=0.01, after 30 epoch, I reached 99.08 in lfw, and 92 in AgeDB-30, as shown follow.
@GranMin Maybe use of 30 epochs is too much and you have got some overfitting?.. Do you have previous checkpoints from 10-15 epochs? If yes, try to check them on val datasets.
@Androsimus I have the same feeling of too much epochs.But the loss of 10-15 epoch is about 20, and the acc of lfw is about 98.
@GranMin This is very strange. Maybe you changed some other parameters? Maybe parameters of Arcface: margin, scale? Because there are other issues, where persons wrote about good results using Resnet50 on native for this repository dataset.
@Androsimus I don't change any other parameters. And I tried NormHead for one epoch, then use the Archead, it's amazing that just after one epoch in Archead, the loss comes to about 11. But then the same phenomenon took place: the loss increase a few at the begin of the epoch, and then decrease, but at very low speed. Like this:
@GranMin I'm not sure how NormHead is supposed to use. Maybe as a warmup. But the NormHead and the ArcHead are completely different. As far as I understand, NormHead is for ordinary classification, if so, then classification problem is much easier and due to this you achieve low loss much faster. Otherwise, when you use ArcHead with its margin and scale parameters and its different concept the classification problem becomes harder. But I don't understand your situation: on the one hand you said about loss ~20 on 10-15 epochs, on the other hand after first epoch you had loss <11 and then it decreased at low speed...
To sum up. I used strictly ArcHead, I suppose other people did it too. So I propose to try using only ArcHead.
@Androsimus In fact, I tried two times about training the model recently. The first time, I use only ArcHead and train at lr=0.01 with SGD for 10-15 epochs, loss ~20. Finally, I trained about 30 epochs to get a loss ~6.5 and accuracy 99.06 on lfw. The second time, I tried the NormHead for one epoch and the changed to Archead, after one epoch with ArcHead, loss down to 10. But as described latest, the speed come down. As for the difference of two head in math is that NormHead just use softmax to ensure correct classification, it work not so well on boundary between classes. And Archead forces a theta between two classes, to avoid two classes adjoin with each other.
@GranMin If you look at this post and https://github.com/peteryuX/arcface-tf2/issues/4#issuecomment-599015569 and thread, then you will see big differences from your results. Strange. Nevertheless, if you have your old logs and chechpoints from training using only ArcHead, then try to find checkpoint that correspond to loss about 8-9 and try it on validation datasets.
results as: first for loss 9.16 and second for loss 8.00
@GranMin this is some mystery )
@GranMin could you post your config file *.yaml? Anyway if you figure out a reason of that strange model training behavior, please write about it.
@GranMin There is one idea. For correct inference model must be used as
model(input, training=False)
The author didn't use it for some reason.
So you can try to add training=False in modules/evaluations.py to _performval function.
@Androsimus Sorry for long waiting. Hahah...I just took a vacation to Jiuzhai Gou nature reserve last week. I talk to my teacher and then I know they broaden my dataset with some asian faces. I redown the dataset the author provided, and got this result just for 2 epoch, just use arcface head. I decide to train resnet152 from scratch following time. And the experience that use softmax first may also help. Again, thank you for your advice~best wishes, my friend!
And for 5 epoch finished, this is final result:
@GranMin glad you got nice results :) Best wishes!
@Androsimus In fact, I tried two times about training the model recently. The first time, I use only ArcHead and train at lr=0.01 with SGD for 10-15 epochs, loss ~20. Finally, I trained about 30 epochs to get a loss ~6.5 and accuracy 99.06 on lfw. The second time, I tried the NormHead for one epoch and the changed to Archead, after one epoch with ArcHead, loss down to 10. But as described latest, the speed come down. As for the difference of two head in math is that NormHead just use softmax to ensure correct classification, it work not so well on boundary between classes. And Archead forces a theta between two classes, to avoid two classes adjoin with each other.
How do you change head (normhead to archead)? I tried it but I have this error:
raise ValueError( ValueError: Cannot assign value to variable ' conv2d/bias:0': Shape mismatch.The variable shape (24,), and the assigned value shape (32,) are incompatible.
I use the same train dataset and test dataset as you proposed, but the best result I've got so far is as the picture shows. I used SGD optimizer and lr=0.1,0.05,0.01,0.0001,0.00001, each lr an epoch. And when I found the loss increasing rather than decreasing, I stoped training. And I got the test result for loss 19.42 as up picture. More, this is test result when the train loss is 21.15, shown as down picture.