Mobilenet model trained on casia dataset

AnujPanthri commented 1 year ago

I was trying to train a mobilenet model using arcloss on casia dataset and I am unable to exceed lfw acc more than ~0.9763.

I wanna know is this the best accuracy I can get using this dataset(casia) ? as training on ms1m is not possible for me as it is really large.

Also as I was checking your training scripts I saw that you are using large batch size and you have done a lot of experiments, didn't those took a lot of this ? what hardware did you used to train them?

I am mainly using google colab and kaggle for training .

training code :

data_path = "faces_webface_112x112_112x112_folders"
eval_paths = ["faces_webface_112x112/lfw.bin"]

basic_model = models.buildin_models("MobileNet", dropout=0, emb_shape=256, output_layer="E") 

tt = train.Train(data_path, save_path='mobilenet_256_adam_E.h5',
    eval_paths=eval_paths,
    basic_model=basic_model,
    batch_size=512, random_status=0,
    lr_base=0.001, lr_decay=0.5, lr_decay_steps=16, lr_min=1e-5)

optimizer = keras.optimizers.Adam(learning_rate=0.001)
sch = [
  {"loss": losses.ArcfaceLoss(scale=16), "epoch": 20, "optimizer": optimizer},
]
tt.train(sch, 0)

training logs:

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/mobilenet/mobilenet_1_0_224_tf_no_top.h5
17225924/17225924 [==============================] - 0s 0us/step
>>>> L2 regularizer value from basic_model: 0
>>>> Init type by loss function name...
>>>> Train arcface...
>>>> Init softmax dataset...
>>>> Image length: 490623, Image class length: 490623, classes: 10572
>>>> Use specified optimizer: <keras.optimizers.adam.Adam object at 0x78131c6ffb20>
>>>> Add arcface layer, arc_kwargs={'loss_top_k': 1, 'append_norm': False, 'partial_fc_split': 0, 'name': 'arcface'}, vpl_kwargs={'vpl_lambda': 0.15, 'start_iters': -958, 'allowed_delta': 200}...
>>>> loss_weights: {'arcface': 1}
Epoch 1/20

Learning rate for iter 1 is 0.0010000000474974513, global_iterNum is 0
958/958 [==============================] - ETA: 0s - loss: 11.6348 - accuracy: 0.3298
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.43it/s]
>>>> lfw evaluation max accuracy: 0.951500, thresh: 0.476799, previous max accuracy: 0.000000
>>>> Improved = 0.951500
Saving model to: checkpoints/mobilenet_256_adam_E_basic_lfw_epoch_1_0.951500.h5
Epoch 1: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 413s 401ms/step - loss: 11.6348 - accuracy: 0.3298
Epoch 2/20

Learning rate for iter 2 is 0.000990488799288869, global_iterNum is 958
958/958 [==============================] - ETA: 0s - loss: 8.2079 - accuracy: 0.6521
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.69it/s]
>>>> lfw evaluation max accuracy: 0.964000, thresh: 0.400249, previous max accuracy: 0.951500
>>>> Improved = 0.012500
Saving model to: checkpoints/mobilenet_256_adam_E_basic_lfw_epoch_2_0.964000.h5
Epoch 2: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 284s 295ms/step - loss: 8.2079 - accuracy: 0.6521
Epoch 3/20

Learning rate for iter 3 is 0.0009623204241506755, global_iterNum is 1916
958/958 [==============================] - ETA: 0s - loss: 7.1694 - accuracy: 0.7361
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.69it/s]
>>>> lfw evaluation max accuracy: 0.972000, thresh: 0.351204, previous max accuracy: 0.964000
>>>> Improved = 0.008000
Saving model to: checkpoints/mobilenet_256_adam_E_basic_lfw_epoch_3_0.972000.h5
Epoch 3: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 7.1694 - accuracy: 0.7361
Epoch 4/20

Learning rate for iter 4 is 0.0009165775263682008, global_iterNum is 2874
958/958 [==============================] - ETA: 0s - loss: 6.5255 - accuracy: 0.7772
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.70it/s]
>>>> lfw evaluation max accuracy: 0.972000, thresh: 0.326500, previous max accuracy: 0.972000
>>>> Improved = 0.000000
Saving model to: checkpoints/mobilenet_256_adam_E_basic_lfw_epoch_4_0.972000.h5
Epoch 4: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 6.5255 - accuracy: 0.7772
Epoch 5/20

Learning rate for iter 5 is 0.0008550179190933704, global_iterNum is 3832
958/958 [==============================] - ETA: 0s - loss: 6.0505 - accuracy: 0.8044
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.70it/s]
>>>> lfw evaluation max accuracy: 0.974167, thresh: 0.336947, previous max accuracy: 0.972000
>>>> Improved = 0.002167
Saving model to: checkpoints/mobilenet_256_adam_E_basic_lfw_epoch_5_0.974167.h5
Epoch 5: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 6.0505 - accuracy: 0.8044
Epoch 6/20

Learning rate for iter 6 is 0.0007800072198733687, global_iterNum is 4790
958/958 [==============================] - ETA: 0s - loss: 5.6642 - accuracy: 0.8246
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.38it/s]
>>>> lfw evaluation max accuracy: 0.974000, thresh: 0.302503, previous max accuracy: 0.974167

Epoch 6: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 5.6642 - accuracy: 0.8246
Epoch 7/20

Learning rate for iter 7 is 0.0006944283377379179, global_iterNum is 5748
958/958 [==============================] - ETA: 0s - loss: 5.3341 - accuracy: 0.8415
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.70it/s]
>>>> lfw evaluation max accuracy: 0.975500, thresh: 0.289799, previous max accuracy: 0.974167
>>>> Improved = 0.001333
Saving model to: checkpoints/mobilenet_256_adam_E_basic_lfw_epoch_7_0.975500.h5
Epoch 7: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 282s 294ms/step - loss: 5.3341 - accuracy: 0.8415
Epoch 8/20

Learning rate for iter 8 is 0.0006015697144903243, global_iterNum is 6706
958/958 [==============================] - ETA: 0s - loss: 5.0437 - accuracy: 0.8557
Evaluating lfw: 100%|██████████| 24/24 [00:04<00:00,  5.36it/s]
>>>> lfw evaluation max accuracy: 0.976333, thresh: 0.280856, previous max accuracy: 0.975500
>>>> Improved = 0.000833
Saving model to: checkpoints/mobilenet_256_adam_E_basic_lfw_epoch_8_0.976333.h5

Epoch 8: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 5.0437 - accuracy: 0.8557
Epoch 9/20

Learning rate for iter 9 is 0.0005050000036135316, global_iterNum is 7664
958/958 [==============================] - ETA: 0s - loss: 4.7825 - accuracy: 0.8688
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.70it/s]
>>>> lfw evaluation max accuracy: 0.976000, thresh: 0.273528, previous max accuracy: 0.976333

Epoch 9: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 4.7825 - accuracy: 0.8688
Epoch 10/20

Learning rate for iter 10 is 0.0004084303218405694, global_iterNum is 8622
958/958 [==============================] - ETA: 0s - loss: 4.5492 - accuracy: 0.8802
Evaluating lfw: 100%|██████████| 24/24 [00:04<00:00,  4.93it/s]
>>>> lfw evaluation max accuracy: 0.974500, thresh: 0.257037, previous max accuracy: 0.976333

Epoch 10: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 282s 294ms/step - loss: 4.5492 - accuracy: 0.8802
Epoch 11/20

Learning rate for iter 11 is 0.0003155716694891453, global_iterNum is 9580
958/958 [==============================] - ETA: 0s - loss: 4.3433 - accuracy: 0.8900
Evaluating lfw: 100%|██████████| 24/24 [00:04<00:00,  5.55it/s]
>>>> lfw evaluation max accuracy: 0.974833, thresh: 0.279876, previous max accuracy: 0.976333

Epoch 11: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 293ms/step - loss: 4.3433 - accuracy: 0.8900
Epoch 12/20

Learning rate for iter 12 is 0.00022999268549028784, global_iterNum is 10538
958/958 [==============================] - ETA: 0s - loss: 4.1669 - accuracy: 0.8983
Evaluating lfw: 100%|██████████| 24/24 [00:04<00:00,  5.51it/s]
>>>> lfw evaluation max accuracy: 0.975833, thresh: 0.270799, previous max accuracy: 0.976333

Epoch 12: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 4.1669 - accuracy: 0.8983
Epoch 13/20

Learning rate for iter 13 is 0.00015498216089326888, global_iterNum is 11496
958/958 [==============================] - ETA: 0s - loss: 4.0252 - accuracy: 0.9050
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.70it/s]
>>>> lfw evaluation max accuracy: 0.974667, thresh: 0.240988, previous max accuracy: 0.976333

Epoch 13: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 284s 294ms/step - loss: 4.0252 - accuracy: 0.9050
Epoch 14/20

Learning rate for iter 14 is 9.342252451460809e-05, global_iterNum is 12454
958/958 [==============================] - ETA: 0s - loss: 3.9148 - accuracy: 0.9100
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.70it/s]
>>>> lfw evaluation max accuracy: 0.974333, thresh: 0.240347, previous max accuracy: 0.976333

Epoch 14: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 284s 295ms/step - loss: 3.9148 - accuracy: 0.9100
Epoch 15/20

Learning rate for iter 15 is 4.767959035234526e-05, global_iterNum is 13412
958/958 [==============================] - ETA: 0s - loss: 3.8422 - accuracy: 0.9131
Evaluating lfw: 100%|██████████| 24/24 [00:04<00:00,  5.63it/s]
>>>> lfw evaluation max accuracy: 0.974500, thresh: 0.249563, previous max accuracy: 0.976333

Epoch 15: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 282s 293ms/step - loss: 3.8422 - accuracy: 0.9131
Epoch 16/20

Learning rate for iter 16 is 1.95112716028234e-05, global_iterNum is 14370
958/958 [==============================] - ETA: 0s - loss: 3.8022 - accuracy: 0.9150
Evaluating lfw: 100%|██████████| 24/24 [00:04<00:00,  5.61it/s]
>>>> lfw evaluation max accuracy: 0.974500, thresh: 0.247305, previous max accuracy: 0.976333

Epoch 16: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 3.8022 - accuracy: 0.9150
Epoch 17/20

Learning rate for iter 17 is 1e-05, global_iterNum is 15328
958/958 [==============================] - ETA: 0s - loss: 3.7912 - accuracy: 0.9158
Evaluating lfw: 100%|██████████| 24/24 [00:04<00:00,  5.71it/s]
>>>> lfw evaluation max accuracy: 0.975000, thresh: 0.243934, previous max accuracy: 0.976333

Epoch 17: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 278s 289ms/step - loss: 3.7912 - accuracy: 0.9158
Epoch 18/20

Learning rate for iter 18 is 0.0005050000036135316, global_iterNum is 16286
958/958 [==============================] - ETA: 0s - loss: 4.3969 - accuracy: 0.8887
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.70it/s]
>>>> lfw evaluation max accuracy: 0.973667, thresh: 0.249403, previous max accuracy: 0.976333

Epoch 18: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 284s 295ms/step - loss: 4.3969 - accuracy: 0.8887
Epoch 19/20

Learning rate for iter 19 is 0.0005038082017563283, global_iterNum is 17244
958/958 [==============================] - ETA: 0s - loss: 4.3282 - accuracy: 0.8907
Evaluating lfw: 100%|██████████| 24/24 [00:04<00:00,  5.57it/s]
>>>> lfw evaluation max accuracy: 0.974333, thresh: 0.246381, previous max accuracy: 0.976333

Epoch 19: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 294ms/step - loss: 4.3282 - accuracy: 0.8907
Epoch 20/20

Learning rate for iter 20 is 0.0005002443795092404, global_iterNum is 18202
958/958 [==============================] - ETA: 0s - loss: 4.2424 - accuracy: 0.8941
Evaluating lfw: 100%|██████████| 24/24 [00:05<00:00,  4.70it/s]
>>>> lfw evaluation max accuracy: 0.973167, thresh: 0.235816, previous max accuracy: 0.976333

Epoch 20: saving model to checkpoints/mobilenet_256_adam_E.h5
958/958 [==============================] - 283s 295ms/step - loss: 4.2424 - accuracy: 0.8941
>>>> Train arcface DONE!!! epochs = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], model.stop_training = False
>>>> My history:
{
  'lr': [0.0009905084734782577, 0.0009623592486605048, 0.000916633871383965, 0.0008550895727239549, 0.0007800916209816933, 0.0006945220520719886, 0.0006016692495904863, 0.0005051014595665038, 0.0004085298569407314, 0.0003156654420308769, 0.00023007707204669714, 0.00015505391638725996, 9.347893501399085e-05, 4.7718443966005e-05, 1.953109858732205e-05, 1.0000000656873453e-05, 9.999999747378752e-06, 0.0005038107046857476, 0.000500249327160418, 0.0004943501553498209],
  'loss': [11.634810447692871, 8.207895278930664, 7.169422149658203, 6.525516510009766, 6.050527572631836, 5.664212226867676, 5.334094047546387, 5.043704509735107, 4.7824506759643555, 4.549181938171387, 4.343286037445068, 4.166921138763428, 4.0251665115356445, 3.914809465408325, 3.842155694961548, 3.8022029399871826, 3.7911524772644043, 4.396862506866455, 4.328181743621826, 4.242437362670898],
  'accuracy': [0.32976415753364563, 0.6521174311637878, 0.7361018061637878, 0.7771826982498169, 0.8043939471244812, 0.8246305584907532, 0.8415114283561707, 0.855713427066803, 0.8687940239906311, 0.8802070021629333, 0.8900052309036255, 0.8982886672019958, 0.9050226807594299, 0.9099584817886353, 0.9130941033363342, 0.9150410890579224, 0.9158015847206116, 0.8887085914611816, 0.8907004594802856, 0.8940786719322205],
  'lfw': [0.9515, 0.964, 0.972, 0.972, 0.9741666666666666, 0.974, 0.9755, 0.9763333333333334, 0.976, 0.9745, 0.9748333333333333, 0.9758333333333333, 0.9746666666666667, 0.9743333333333334, 0.9745, 0.9745, 0.975, 0.9736666666666667, 0.9743333333333334, 0.9731666666666666],
  'lfw_thresh': [0.4767988324165344, 0.40024882555007935, 0.35120391845703125, 0.3265003561973572, 0.3369472026824951, 0.30250275135040283, 0.2897985875606537, 0.28085601329803467, 0.273528128862381, 0.2570366859436035, 0.2798755168914795, 0.2707985043525696, 0.24098770320415497, 0.2403470277786255, 0.24956296384334564, 0.24730539321899414, 0.2439337521791458, 0.2494034618139267, 0.24638071656227112, 0.23581618070602417],
}
>>>> Saving latest basic model to: checkpoints/mobilenet_256_adam_E_basic_model_latest.h5

leondgarse commented 1 year ago

It could be rather hard reaching any satisfactory result using mobilenet +casia only, even not possible... I'm previously using RTX8000 with 46GB GPU memory. Regarding your script, may try:

For lr_decay_steps=16, will using a cosine learning rate with restart, and lr will will be restarted on epoch [16+1==17, 17 + 16 * 2 + 1 == 50] with value [lr_base / 2 == 5e-4, lr_base / 4 == 2.5e-4]. So it's better set total epochs 17 or 50.
As you are using colab, and it should be using TF>=2.12.0, may add some weight_decay to keras.optimizer.Adam.

May further increase ArcfaceLoss scale till 64.


data_path = "faces_webface_112x112_112x112_folders"
eval_paths = ["faces_webface_112x112/lfw.bin"]

basic_model = models.buildin_models("MobileNet", dropout=0, emb_shape=256, output_layer="E")

tt = train.Train(data_path, save_path='mobilenet_256_adam_E.h5', eval_paths=eval_paths, basic_model=basic_model, batch_size=512, random_status=0, lr_base=0.001, lr_decay=0.5, lr_decay_steps=16, lr_min=1e-5)

optimizer = keras.optimizers.Adam(learning_rate=0.001, weight_decay=5e-4) sch = [ {"loss": losses.ArcfaceLoss(scale=16), "epoch": 10, "optimizer": optimizer}, {"loss": losses.ArcfaceLoss(scale=32), "epoch": 10}, {"loss": losses.ArcfaceLoss(scale=64), "epoch": 30}, ] tt.train(sch, 0)

AnujPanthri commented 1 year ago

First of all this repo is really helpful for me and it has helped me a lot , so thank you for this and more over thank you for replying to me .

Wow 46 gb vram is impressive probably that is why you were able to use large batch sizes.

and I feel the main bottleneck is using casia dataset in my case , as you have got better results with mobilenet when trained on MS1M dataset.

leondgarse / Keras_insightface

Mobilenet model trained on casia dataset #116