Resume training - Githubissues

HamDan1999 commented 2 years ago

Hey, I hope you are doing well.

My code crashed on the 22nd epoch while running one of your models shown below:

import losses, train, models data_basic_path = '/datasets/ms1m-retinaface-t1' data_path = data_basic_path + '_112x112_folders' eval_paths = [os.path.join(data_basic_path, ii) for ii in ['lfw.bin', 'cfp_fp.bin', 'agedb_30.bin']]

basic_model = models.buildin_models("ghostnet", dropout=0, emb_shape=512, output_layer='GDC', bn_momentum=0.9, bn_epsilon=1e-5) basic_model = models.add_l2_regularizer_2_model(basic_model, weight_decay=5e-4, apply_to_batch_normal=False) basic_model = models.replace_ReLU_with_PReLU(basic_model)

tt = train.Train(data_path, eval_paths=eval_paths, save_path='TT_ghostnet_prelu_GDC_arc_emb512_dr0_sgd_l2_5e4_bs1024_ms1m_bnm09_bne1e5_cos16_batch_fixed.h5', basic_model=basic_model, model=None, lr_base=0.1, lr_decay=0.5, lr_decay_steps=16, lr_min=1e-5, batch_size=1024, random_status=0, eval_freq=2000, output_weight_decay=1)

optimizer = keras.optimizers.SGD(learning_rate=0.1, momentum=0.9) sch = [ {"loss": losses.ArcfaceLoss(scale=32), "epoch": 1, "optimizer": optimizer}, {"loss": losses.ArcfaceLoss(scale=64), "epoch": 48}, ] tt.train(sch, 0)

How can I resume training from the previously saved model, like can you show me or write an example of resume training? (Note I tried "Restore training from break point" section but it did not work with me, it seems that I have messed it up somewhere)

HamDan1999 commented 2 years ago

I currently have three ".h5" files and one ".json file"

leondgarse commented 2 years ago

Technically, you can get it back by setting baisc_model=None and the initial_epoch for tt.train and remaining epochs for sch.

import losses, train, models
data_basic_path = '/datasets/ms1m-retinaface-t1'
data_path = data_basic_path + '_112x112_folders'
eval_paths = [os.path.join(data_basic_path, ii) for ii in ['lfw.bin', 'cfp_fp.bin', 'agedb_30.bin']]

# basic_model = models.buildin_models("ghostnet", dropout=0, emb_shape=512, output_layer='GDC', bn_momentum=0.9, bn_epsilon=1e-5)
# basic_model = models.add_l2_regularizer_2_model(basic_model, weight_decay=5e-4, apply_to_batch_normal=False)
# basic_model = models.replace_ReLU_with_PReLU(basic_model)
basic_model = None  # >>>> 1st: set basic_model as None

tt = train.Train(data_path, eval_paths=eval_paths,
    save_path='TT_ghostnet_prelu_GDC_arc_emb512_dr0_sgd_l2_5e4_bs1024_ms1m_bnm09_bne1e5_cos16_batch_fixed.h5',
    basic_model=basic_model, model=None, lr_base=0.1, lr_decay=0.5, lr_decay_steps=16, lr_min=1e-5,
    batch_size=1024, random_status=0, eval_freq=2000, output_weight_decay=1
)

# optimizer = keras.optimizers.SGD(learning_rate=0.1, momentum=0.9)
sch = [
    # {"loss": losses.ArcfaceLoss(scale=32), "epoch": 1, "optimizer": optimizer},
    {"loss": losses.ArcfaceLoss(scale=64), "epoch": 28},  # >>>> 2nd: set remaining epochs 50 - 22
]
tt.train(sch, 22)  # >>>> 3rd: set initial_epoch 22

When basic_model=None, model=None, the script will try to reload model from checkpoints/{save_path}, or you can spesific model as model="checkpoints/TT_ghostnet_prelu_GDC_arc_emb512_dr0_sgd_l2_5e4_bs1024_ms1m_bnm09_bne1e5_cos16_batch_fixed.h5"

HamDan1999 commented 2 years ago

Hi. I just tried the above script and it is training, I will report back once I get validation results.

Thanks a lot for your help.

HamDan1999 commented 2 years ago

It is working, thanks for your help.

leondgarse / Keras_insightface

Resume training #100