Closed abdikaiym01 closed 2 years ago
It's not working well. I've tried glint360k, it can be trained, just took longer, much longer. Also tried implementing the partial_fc
strategy, but the training results are not satisfying. Will try it again once got spare time.
Ok, thank you! You suggest use their official implementation ( https://github.com/deepinsight/insightface/tree/master/recognition/arcface_torch ) to train on large datasets (glint360 or webface12M)? Did you try to use their pytorch implementation (recognition/arcface_torch) ?
Yes, this repo still haven't replicated partial_fc
implementation. For some testing, current partial_fc
in this repo also works for a single GPU, but my training result is not good. I haven't tried their pytorch implementation, just read some results compare. Currently, training without partial_fc
using Glint360k
dataset takes almost 5
times as MS1MV3
, while total images is only 3
times. So, ya, this is something need to be supported...
partial_fc
implementation is still hard for me, as it needs detail control of how to shard / aggregation over multiple replicas. Tensorflow introduced DTensor in TF 2.9.0
, and I think it's a key feature implementing this, also a key for training other large models / datasets. It's rather new for me, and need some learning and testing on it, will try if it's possible for partial_fc
.
Thank you, but insightface implemented 'partial_fc' in pytorch, is it because pytorch more comfortable use technic introduced in paper 'partial_fc'?
Though it's hard for me, still believe TF sharing the same potential implementing it, and may also be as comfortable. Like for someone more familiar with distribution strategy, writing custom gathering / training steps. As far as I can see, it's the output NormDense
layer, that may need a concatenate
strategy on gathering weights, and currently it's SUM
/ MEAN
available.
I'm not sure if you still looking forward for this, just finished some basic training on Glint360k with efficientnetv2s. Here's some results:
Environment: TF 2.6.3
+ GPU RTX8000
with 45G memory.
Batch size 256. Without partialFC it's 680ms/step
, 12.6hrs/epoch
. With partialFC it's 595ms/step
, 8.7hrs/epoch (less batches)
. Result with EfficientNetV2S
+ Glint360K
+ MagFace
+ 25 epochs
:
Method | lfw | cfp_fp | agedb_30 | IJBB | IJBC |
---|---|---|---|---|---|
No PartialFC | 0.998500 | 0.992286 | 0.983667 | 0.958909 | 0.971212 |
PartialFC 4 | 0.998167 | 0.993000 | 0.983833 | 0.956378 | 0.969218 |
Method | 1e-06 | 1e-05 | 0.0001 | 0.001 | 0.01 | 0.1 | AUC |
---|---|---|---|---|---|---|---|
IJBB, No PartialFC | 0.439435 | 0.923856 | 0.958909 | 0.969231 | 0.978286 | 0.985589 | 0.992529 |
IJBC, No PartialFC | 0.89528 | 0.956691 | 0.971212 | 0.978933 | 0.985172 | 0.99008 | 0.994988 |
IJBB, PartialFC 4 | 0.404284 | 0.92483 | 0.956378 | 0.970204 | 0.97887 | 0.987634 | 0.993442 |
IJBC, PartialFC 4 | 0.889042 | 0.955003 | 0.969218 | 0.979649 | 0.985274 | 0.990745 | 0.994939 |
Batch size 480. Using PartialFC 4
, total identities is 90058
apiece, similar with MS1MV3
. This actually makes it possible for training with a larger batch_size, for RTX8000
it's 480
. For MS1MV3
with batch_size=512
, training speed is 867ms/step
, 8781s/epoch
. While for Glint360K
using PartialFC 4
with 3 times total images, training speed is 842ms/step
, 23675s/epoch
, also almost as 3 times.
If you wanna a try, just give partial_fc_split=4
for train.Train
:
tt = train.Train(..., partial_fc_split=4)
Actually it's a different implementation from official one. For partial_fc_split=4
, it will split all identities in 4 pieces, and generate training data in sequential order from each split like batch_size * split_1, batch_size * split_2, batch_size * split_3, batch_size * split_4, batch_size * split_1, ...
. Model will also switch header accordingly. This makes it also workable on a single GPU, and for multi GPU, data will still be distributed on batch
dimension.
O thank you, good job, I'm impressed. Could you realse this pretrained moldes?
I put it here TT_effv2_s_glint360k_mag_bs_256_test_random_0_E25_basic_model_latest.h5. Still trying some training, but met some loss=Nan
error...
OK, it's been 1 month, and my r100 PReLU dropout 0.4 using SGD + l2 regularizer + randaug + AdaFace training on Glint360K dataset and partial FC is finally finished! Now can claim it a reproduced of partialFC result. :)
Hi, have you experienced large datasets with your framework (Keras_insightface)? For example, can I use your repository in order to train glint360 or webface12M?