keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.98k stars 19.48k forks source link

How to continue monitor loss with ModelCheckPoint of a saved model #11420

Closed Golbstein closed 3 years ago

Golbstein commented 6 years ago

Hey there,

I'm training my model with the following callback:

checkpoint = ModelCheckpoint('weights/icnet_768_ft.h5', 
                      mode = 'max', save_best_only=True, save_weights_only=True, 
                      monitor='val_reshape_1_Mean_IOU', verbose=1)

I've trained the model for 50 epochs and saved the weights of the best "IOU" score.

Now I want to load the model with these weights and keep monitoring the checkpoint using my best IOU score.

Example: in epoch 47 I've gotten the best IOU score of 0.6 When I initialize the model with the weights of this epoch and starts training, it saves the weights of the first epoch even though the IOU decreases to 0.58 but it thinks it has improved from -inf

brge17 commented 6 years ago

This is actually a larger issue - any callbacks that have a notion of state are lost when resuming.

cloudseasail commented 5 years ago

I tried following workaround, it looks good so far. Save your best loss from last training , add init the best in ModelCheckPoint callback

class ModelCheckpointWrapper(ModelCheckpoint):
    def __init__(self, best_init=None, *arg, **kwagrs):
        super().__init__(*arg, **kwagrs)
        if best_init is not None:
            self.best = best_init

checkpointer = ModelCheckpointWrapper(best_init=BEST_LOSS, filepath=saved_model, verbose=1, save_best_only=True)

And I use csvlogger to save my training results like this

if os.path.exists(saved_log):
    with open(saved_log) as f:
        reader = csv.DictReader(f)
        rows = [row for row in reader]
        EPOCH_INIT = int(rows[-1]['epoch'])+1
        BEST_LOSS = float(rows[-1]['val_loss'])
        print('Resume training from EPOCH_INIT {0} ,BEST_LOSS {1}, BEST_ACC {2}'.format(EPOCH_INIT, BEST_LOSS, rows[-1]['val_acc']))

csvlogger = callbacks.CSVLogger(saved_log, separator=',', append=True)

model.fit_generator(
      train_generator,
      ......
      callbacks = [checkpointer, csvlogger],
      initial_epoch=EPOCH_INIT)
Jingnan-Jia commented 4 years ago

@cloudseasail Thanks for your code. But I think in your code, BEST_LOSS may NOT be the REALLY BEST LOSS. float(rows[-1]['val_loss']) is just the last loss value saved in csvlog file. It is not the Best_loss. So we could select the BEST_LOSS by the following code:

import pandas as pd

if os.path.exists(saved_log):
    df = pd.read_csv(saved_log)
    EPOCH_INIT =df.epoch[-1] + 1
    BEST_LOSS = min(float(df.val_loss))
    print('Retraining from EPOCH_INIT {0} ,BEST_LOSS {1}'.format(EPOCH_INIT, BEST_LOSS))