Training on large datasets with a lot of identities

abdikaiym01 commented 2 years ago

Hi, have you experienced large datasets with your framework (Keras_insightface)? For example, can I use your repository in order to train glint360 or webface12M?

leondgarse commented 2 years ago

It's not working well. I've tried glint360k, it can be trained, just took longer, much longer. Also tried implementing the partial_fc strategy, but the training results are not satisfying. Will try it again once got spare time.

abdikaiym01 commented 2 years ago

Ok, thank you! You suggest use their official implementation ( https://github.com/deepinsight/insightface/tree/master/recognition/arcface_torch ) to train on large datasets (glint360 or webface12M)? Did you try to use their pytorch implementation (recognition/arcface_torch) ?

leondgarse commented 2 years ago

Yes, this repo still haven't replicated partial_fc implementation. For some testing, current partial_fc in this repo also works for a single GPU, but my training result is not good. I haven't tried their pytorch implementation, just read some results compare. Currently, training without partial_fc using Glint360k dataset takes almost 5 times as MS1MV3, while total images is only 3 times. So, ya, this is something need to be supported...

leondgarse commented 2 years ago

partial_fc implementation is still hard for me, as it needs detail control of how to shard / aggregation over multiple replicas. Tensorflow introduced DTensor in TF 2.9.0, and I think it's a key feature implementing this, also a key for training other large models / datasets. It's rather new for me, and need some learning and testing on it, will try if it's possible for partial_fc.

abdikaiym01 commented 2 years ago

Thank you, but insightface implemented 'partial_fc' in pytorch, is it because pytorch more comfortable use technic introduced in paper 'partial_fc'?

leondgarse commented 2 years ago

Though it's hard for me, still believe TF sharing the same potential implementing it, and may also be as comfortable. Like for someone more familiar with distribution strategy, writing custom gathering / training steps. As far as I can see, it's the output NormDense layer, that may need a concatenate strategy on gathering weights, and currently it's SUM / MEAN available.

leondgarse commented 2 years ago

I'm not sure if you still looking forward for this, just finished some basic training on Glint360k with efficientnetv2s. Here's some results:

Environment: TF 2.6.3 + GPU RTX8000 with 45G memory.

Batch size 256. Without partialFC it's 680ms/step, 12.6hrs/epoch. With partialFC it's 595ms/step, 8.7hrs/epoch (less batches). Result with EfficientNetV2S + Glint360K + MagFace + 25 epochs:

Method	lfw	cfp_fp	agedb_30	IJBB	IJBC
No PartialFC	0.998500	0.992286	0.983667	0.958909	0.971212
PartialFC 4	0.998167	0.993000	0.983833	0.956378	0.969218

Method	1e-06	1e-05	0.0001	0.001	0.01	0.1	AUC
IJBB, No PartialFC	0.439435	0.923856	0.958909	0.969231	0.978286	0.985589	0.992529
IJBC, No PartialFC	0.89528	0.956691	0.971212	0.978933	0.985172	0.99008	0.994988
IJBB, PartialFC 4	0.404284	0.92483	0.956378	0.970204	0.97887	0.987634	0.993442
IJBC, PartialFC 4	0.889042	0.955003	0.969218	0.979649	0.985274	0.990745	0.994939

Batch size 480. Using PartialFC 4, total identities is 90058 apiece, similar with MS1MV3. This actually makes it possible for training with a larger batch_size, for RTX8000 it's 480. For MS1MV3 with batch_size=512, training speed is 867ms/step, 8781s/epoch. While for Glint360K using PartialFC 4 with 3 times total images, training speed is 842ms/step, 23675s/epoch, also almost as 3 times.
If you wanna a try, just give partial_fc_split=4 for train.Train:
```
tt = train.Train(..., partial_fc_split=4)
```
Actually it's a different implementation from official one. For partial_fc_split=4, it will split all identities in 4 pieces, and generate training data in sequential order from each split like batch_size * split_1, batch_size * split_2, batch_size * split_3, batch_size * split_4, batch_size * split_1, .... Model will also switch header accordingly. This makes it also workable on a single GPU, and for multi GPU, data will still be distributed on batch dimension.

abdikaiym01 commented 2 years ago

O thank you, good job, I'm impressed. Could you realse this pretrained moldes?

leondgarse commented 2 years ago

I put it here TT_effv2_s_glint360k_mag_bs_256_test_random_0_E25_basic_model_latest.h5. Still trying some training, but met some loss=Nan error...

leondgarse commented 2 years ago

OK, it's been 1 month, and my r100 PReLU dropout 0.4 using SGD + l2 regularizer + randaug + AdaFace training on Glint360K dataset and partial FC is finally finished! Now can claim it a reproduced of partialFC result. :)

leondgarse / Keras_insightface

Training on large datasets with a lot of identities #90