Closed ak4ever87 closed 1 year ago
What is the error info? Technically it should be basic_model=xxx_basic_xxx.h5, model=None
.
tt = train.Train(data_path, save_path='check_mobile.h5', eval_paths=eval_paths,
basic_model="pretrained_models/keras_mobilenet_emore_adamw_5e5_soft_baseline_before_arc_E80_BTO_E2_arc_sgdw_basic_agedb_30_epoch_119_0.959333.h5",
model=None,
lr_base=0.0001, lr_decay=0.1, lr_decay_steps=[90, 100], batch_size=256, random_status=0)
optimizer = tfa.optimizers.SGDW(learning_rate=0.0001, weight_decay=5e-6, momentum=0.9)
# May train 2 epoches of header only by setting `bottleneckOnly=True`
# tt.train_single_scheduler(loss=losses.ArcfaceLoss(scale=64), epoch=2, optimizer=optimizer, bottleneckOnly=True, initial_epoch=0)
# Set a new optimizer after `bottleneckOnly=True`
# optimizer = tfa.optimizers.SGDW(learning_rate=0.0001, weight_decay=5e-6, momentum=0.9)
tt.train_single_scheduler(loss=losses.ArcfaceLoss(scale=64), epoch=2, optimizer=optimizer, initial_epoch=0)
Hi, Thanks for the answer, I tried to execute your suggested code to fine-tune the model, it first runs two epochs to fine-tune the bottleneck layer, then another two epochs to fine-tune all model
tt = train.Train(data_path, save_path='check_mobile.h5', eval_paths=eval_paths, basic_model="pretrained_models/keras_mobilenet_emore_adamw_5e5_soft_baseline_before_arc_E80_BTO_E2_arc_sgdw_basic_agedb_30_epoch_119_0.959333.h5", model=None, lr_base=0.0001, lr_decay=0.1, lr_decay_steps=[90, 100], batch_size=256, random_status=0)
optimizer = tfa.optimizers.SGDW(learning_rate=0.0001, weight_decay=5e-6, momentum=0.9)
tt.train_single_scheduler(loss=losses.ArcfaceLoss(scale=64), epoch=2, optimizer=optimizer, bottleneckOnly=True, initial_epoch=0)
optimizer = tfa.optimizers.SGDW(learning_rate=0.0001, weight_decay=5e-6, momentum=0.9)
tt.train_single_scheduler(loss=losses.ArcfaceLoss(scale=64), epoch=2, optimizer=optimizer, initial_epoch=0)
The fine-tuning of the bottleneck layer runs perfectly, after that, the training collapse with OOM error
I can't understand what is the reasons, since the model is already stored in the tt object
the is the error message that I received: "Allocator (GPU_0_bfc) ran out of memory trying to allocate 392.00MiB (rounded to 411041792)requested by op model/conv_dw_3_bn/FusedBatchNormV3"
bottleneckOnly=True
, most model layers are set to trainable=False
, and it runs rather fast. Then in bottleneckOnly=False
training, all layers are set to trainable=True
, and needs more GPU memories to store the gradients.I forgot if it's a fp16 model or not, may try if creating model in fp16, then reloading weights help.
keras.mixed_precision.set_global_policy("mixed_float16")
import models
basic_model = models.buildin_models("mobilenet", dropout=0, emb_shape=256, output_layer='GDC', use_bias=True, scale=False)
basic_model.load_weights('pretrained_models/keras_mobilenet_emore_adamw_5e5_soft_baseline_before_arc_E80_BTO_E2_arc_sgdw_basic_agedb_30_epoch_119_0.959333.h5')
tt = train.Train(..., basic_model=basic_model, ...)
In my basic test, batch_size=256
requires 6065MiB
GPU memory.
Hi, I am trying to finetune a pretrained Mobilenet emb256 model using the MS1MV3 dataset, I use the following code:
After only one epoch the results of the model collapse Is there any reason why it's happening?