leondgarse / Keras_insightface

Insightface Keras implementation
MIT License
230 stars 56 forks source link

finetune pretrained model #112

Closed ak4ever87 closed 1 year ago

ak4ever87 commented 1 year ago

Hi, I am trying to finetune a pretrained Mobilenet emb256 model using the MS1MV3 dataset, I use the following code:

tt = train.Train(data_path, save_path='check_mobile.h5', eval_paths=eval_paths,
    basic_model=None,
    model="pretrained_models/keras_mobilenet_emore_adamw_5e5_soft_baseline_before_arc_E80_BTO_E2_arc_sgdw_basic_agedb_30_epoch_119_0.959333.h5",
    lr_base=0.0001, lr_decay=0.1, lr_decay_steps=[90, 100], batch_size=256, random_status=0)
optimizer = tfa.optimizers.SGDW(learning_rate=0.0001, weight_decay=5e-6, momentum=0.9)

tt.train_single_scheduler(loss=losses.ArcfaceLoss(scale=64), epoch=2, optimizer=optimizer, initial_epoch=0)

After only one epoch the results of the model collapse Is there any reason why it's happening?

leondgarse commented 1 year ago

What is the error info? Technically it should be basic_model=xxx_basic_xxx.h5, model=None.

tt = train.Train(data_path, save_path='check_mobile.h5', eval_paths=eval_paths,
    basic_model="pretrained_models/keras_mobilenet_emore_adamw_5e5_soft_baseline_before_arc_E80_BTO_E2_arc_sgdw_basic_agedb_30_epoch_119_0.959333.h5",
    model=None,
    lr_base=0.0001, lr_decay=0.1, lr_decay_steps=[90, 100], batch_size=256, random_status=0)
optimizer = tfa.optimizers.SGDW(learning_rate=0.0001, weight_decay=5e-6, momentum=0.9)

# May train 2 epoches of header only by setting `bottleneckOnly=True`
# tt.train_single_scheduler(loss=losses.ArcfaceLoss(scale=64), epoch=2, optimizer=optimizer, bottleneckOnly=True, initial_epoch=0)
# Set a new optimizer after `bottleneckOnly=True`
# optimizer = tfa.optimizers.SGDW(learning_rate=0.0001, weight_decay=5e-6, momentum=0.9)

tt.train_single_scheduler(loss=losses.ArcfaceLoss(scale=64), epoch=2, optimizer=optimizer, initial_epoch=0)
ak4ever87 commented 1 year ago

Hi, Thanks for the answer, I tried to execute your suggested code to fine-tune the model, it first runs two epochs to fine-tune the bottleneck layer, then another two epochs to fine-tune all model

tt = train.Train(data_path, save_path='check_mobile.h5', eval_paths=eval_paths, basic_model="pretrained_models/keras_mobilenet_emore_adamw_5e5_soft_baseline_before_arc_E80_BTO_E2_arc_sgdw_basic_agedb_30_epoch_119_0.959333.h5",    model=None,    lr_base=0.0001, lr_decay=0.1, lr_decay_steps=[90, 100], batch_size=256, random_status=0)
optimizer = tfa.optimizers.SGDW(learning_rate=0.0001, weight_decay=5e-6, momentum=0.9)
tt.train_single_scheduler(loss=losses.ArcfaceLoss(scale=64), epoch=2, optimizer=optimizer, bottleneckOnly=True, initial_epoch=0)
optimizer = tfa.optimizers.SGDW(learning_rate=0.0001, weight_decay=5e-6, momentum=0.9)
tt.train_single_scheduler(loss=losses.ArcfaceLoss(scale=64), epoch=2, optimizer=optimizer, initial_epoch=0)
The fine-tuning of the bottleneck layer runs perfectly, after that, the training collapse with OOM error
I can't understand what is the reasons, since the model is already stored in the tt object

the is the error message that I received: "Allocator (GPU_0_bfc) ran out of memory trying to allocate 392.00MiB (rounded to 411041792)requested by op model/conv_dw_3_bn/FusedBatchNormV3"

leondgarse commented 1 year ago