finetune pretrained model

ak4ever87 commented 1 year ago

Hi, I am trying to finetune a pretrained Mobilenet emb256 model using the MS1MV3 dataset, I use the following code:

tt = train.Train(data_path, save_path='check_mobile.h5', eval_paths=eval_paths,
    basic_model=None,
    model="pretrained_models/keras_mobilenet_emore_adamw_5e5_soft_baseline_before_arc_E80_BTO_E2_arc_sgdw_basic_agedb_30_epoch_119_0.959333.h5",
    lr_base=0.0001, lr_decay=0.1, lr_decay_steps=[90, 100], batch_size=256, random_status=0)
optimizer = tfa.optimizers.SGDW(learning_rate=0.0001, weight_decay=5e-6, momentum=0.9)

tt.train_single_scheduler(loss=losses.ArcfaceLoss(scale=64), epoch=2, optimizer=optimizer, initial_epoch=0)

After only one epoch the results of the model collapse Is there any reason why it's happening?

leondgarse commented 1 year ago

What is the error info? Technically it should be basic_model=xxx_basic_xxx.h5, model=None.

tt = train.Train(data_path, save_path='check_mobile.h5', eval_paths=eval_paths,
    basic_model="pretrained_models/keras_mobilenet_emore_adamw_5e5_soft_baseline_before_arc_E80_BTO_E2_arc_sgdw_basic_agedb_30_epoch_119_0.959333.h5",
    model=None,
    lr_base=0.0001, lr_decay=0.1, lr_decay_steps=[90, 100], batch_size=256, random_status=0)
optimizer = tfa.optimizers.SGDW(learning_rate=0.0001, weight_decay=5e-6, momentum=0.9)

# May train 2 epoches of header only by setting `bottleneckOnly=True`
# tt.train_single_scheduler(loss=losses.ArcfaceLoss(scale=64), epoch=2, optimizer=optimizer, bottleneckOnly=True, initial_epoch=0)
# Set a new optimizer after `bottleneckOnly=True`
# optimizer = tfa.optimizers.SGDW(learning_rate=0.0001, weight_decay=5e-6, momentum=0.9)

tt.train_single_scheduler(loss=losses.ArcfaceLoss(scale=64), epoch=2, optimizer=optimizer, initial_epoch=0)

ak4ever87 commented 1 year ago

Hi, Thanks for the answer, I tried to execute your suggested code to fine-tune the model, it first runs two epochs to fine-tune the bottleneck layer, then another two epochs to fine-tune all model

tt = train.Train(data_path, save_path='check_mobile.h5', eval_paths=eval_paths, basic_model="pretrained_models/keras_mobilenet_emore_adamw_5e5_soft_baseline_before_arc_E80_BTO_E2_arc_sgdw_basic_agedb_30_epoch_119_0.959333.h5",    model=None,    lr_base=0.0001, lr_decay=0.1, lr_decay_steps=[90, 100], batch_size=256, random_status=0)
optimizer = tfa.optimizers.SGDW(learning_rate=0.0001, weight_decay=5e-6, momentum=0.9)
tt.train_single_scheduler(loss=losses.ArcfaceLoss(scale=64), epoch=2, optimizer=optimizer, bottleneckOnly=True, initial_epoch=0)
optimizer = tfa.optimizers.SGDW(learning_rate=0.0001, weight_decay=5e-6, momentum=0.9)
tt.train_single_scheduler(loss=losses.ArcfaceLoss(scale=64), epoch=2, optimizer=optimizer, initial_epoch=0)
The fine-tuning of the bottleneck layer runs perfectly, after that, the training collapse with OOM error
I can't understand what is the reasons, since the model is already stored in the tt object

the is the error message that I received: "Allocator (GPU_0_bfc) ran out of memory trying to allocate 392.00MiB (rounded to 411041792)requested by op model/conv_dw_3_bn/FusedBatchNormV3"

leondgarse commented 1 year ago

When training bottleneckOnly=True, most model layers are set to trainable=False, and it runs rather fast. Then in bottleneckOnly=False training, all layers are set to trainable=True, and needs more GPU memories to store the gradients.

I forgot if it's a fp16 model or not, may try if creating model in fp16, then reloading weights help.

keras.mixed_precision.set_global_policy("mixed_float16")

import models
basic_model = models.buildin_models("mobilenet", dropout=0, emb_shape=256, output_layer='GDC', use_bias=True, scale=False)
basic_model.load_weights('pretrained_models/keras_mobilenet_emore_adamw_5e5_soft_baseline_before_arc_E80_BTO_E2_arc_sgdw_basic_agedb_30_epoch_119_0.959333.h5')

tt = train.Train(..., basic_model=basic_model, ...)

In my basic test, batch_size=256 requires 6065MiB GPU memory.

leondgarse / Keras_insightface

finetune pretrained model #112