keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
62.03k stars 19.48k forks source link

Not able to resume training after loading model + weights #2378

Closed trane293 closed 6 years ago

trane293 commented 8 years ago

I work at an institute where it is not allowed to run a workstation overnight, hence I had to split the training process into multiple days. I trained a model for 10 epochs which took approximately 1 day, and saved the model + weights using the methods described in keras documentation like this:

 modelPath = './SegmentationModels/
 modelName = 'Arch_1_10'
 sys.setrecursionlimit(10000)
 json_string = model.to_json()
 open(str(modelPath + modelName + '.json'), 'w').write(json_string)
 model.save_weights(str(modelPath + modelName + '.h5'))
 import cPickle as pickle
 with open(str(modelPath + modelName + '_hist.pckl'), 'wb') as f:
     pickle.dump(history.history, f, -1)

and load the model the next day like this:

 modelPath = './SegmentationModels/'
 modelName = 'Arch_1_10'
 model = model_from_json(open(str(modelPath + modelName + '.json')).read())
 model.compile(loss='categorical_crossentropy', optimizer=optim_sgd)
 model.load_weights(str(modelPath + modelName + '.h5'))
 #     import cPickle as pickle
 #     with open(str(modelPath + modelName + '_hist.pckl'), 'r') as f:
 #         history = pickle.load(f)
 model.summary()

but when I restarted the training process it initialized to the same training and validation loss that I had got the earlier day at the 1st epoch! It should have started with an accuracy of 60% which was the last best accuracy I got the earlier day, but it doesn't.

I have also tried to call model.compile() before and after load_weights, as well as leaving it out altogether, but that doesn't work either.

Please help me in this regard. Thanks in advance.

NasenSpray commented 8 years ago

Does it work when you construct the model with the original code instead of loading it from json?

trane293 commented 8 years ago

Nope. It doesn't. Still starts with 20% accuracy as it did on the 1st epoch.

NasenSpray commented 8 years ago

Did the weights file already exist before you tried to save them?

trane293 commented 8 years ago

It did, but now I tried using the ModelCheckpoint callback which saves weight files for each epoch. In my case last weights file for epoch 70 was created (it was not present), I tried loading that into the model loaded using i) JSON ii) using original code, but still no luck.

NasenSpray commented 8 years ago

It did

That's it, save_weights() doesn't overwrite existing files unless you also pass overwrite=True. It should have asked for user input, though.

trane293 commented 8 years ago

Actually sorry for my last comment, all the architectures I save and all weights I save have unique names, and yes I know save_weights() asks for user input when overwriting the file, but in my case it doesn't since the files do not exist. Se we can safely rule out the possibility that the file was not overwritten.

trane293 commented 8 years ago

screen You can see the weights saved after every epoch. When I try to load these weights the training still restarts from where it started initially.

Here's my full loadModel() function:

# optimzers
optim_sgd = keras.optimizers.SGD(lr=0.01, momentum=0.9, decay=0.002, nesterov=True)
optim_adadelta = keras.optimizers.Adadelta()
optim_adagrad = keras.optimizers.Adagrad()
optim_adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08)

imageSize = (19, 19)
img_rows, img_cols = imageSize[0], imageSize[1]
batch_size = 200
# number of convolutional filters to use
nb_filters = 32
# size of pooling area for max pooling
nb_pool = 2
# convolution kernel size
nb_conv = 3

nb_epoch = 1000

# callbacks
def scheduler(epoch):
    if epoch % 10 == 0 and epoch is not 0:
        x = float(input("Enter a learning rate (Current: {}): ".format(model.optimizer.lr.get_value())))
        model.optimizer.lr.set_value(x)
        print("Changed learning rate to: {}".format(model.optimizer.lr.get_value()))
    return model.optimizer.lr.get_value()

change_lr = oc.LearningRateScheduler(scheduler)
early_stop = oc.StopEarly(10)
plot_history = oc.PlotHistory()

# # Load the model
modelPath = './SegmentationModels/'
modelName = 'Arch_1_40'
model = model_from_json(open(str(modelPath + modelName + '.json')).read())
model.compile(loss='categorical_crossentropy', optimizer=optim_sgd)
model.load_weights(str(modelPath + 'weights.70-0.74.hdf5'))
#     import cPickle as pickle
#     with open(str(modelPath + modelName + '_hist.pckl'), 'r') as f:
#         history = pickle.load(f)
model.summary()
NasenSpray commented 8 years ago

That's strange...

replace this line with if 1: and try to load the weights again. nope, dont

trane293 commented 8 years ago

I found out that I was using an older version of Keras. I upgraded the version and found model_summary() is no longer there. Delved deeper and found that it has now been changed to print_summary().

Anyways, I tried changing the line of code you asked, but that didn't work as well.

trane293 commented 8 years ago

UPDATE: Came to the institute this morning, built the model using original code and loaded the model weights saved using ModelCheckpoint callback. Started training and it still restarts from the beginning; no memory of past metrics. The performance is actually even worse than it was earlier when it started training the first epoch. In my case, normally the network starts at 20% accuracy and goes to around 70% in 60 epochs. But when I restart the training process using loaded weights, the network starts at 20% on epoch 1 and keeps going lower and lower until 16% at epoch 5. I have no idea what's happening here.

UPDATE 2: When I try to evaluate the loaded model + weights on the same validation data, I get 60% accuracy, as intended. But if I do model.fit() then training starts from 20% and oscillates on it. So I can confirm that the weights are being loaded correctly since the model can make predictions, but the model is not able to retrain.

Please help! @NasenSpray

carlthome commented 8 years ago

So what model do you have precisely? Perhaps some weights aren't actually saved or loaded at all (like the states in a LSTM or something)? Or perhaps they are accidentally shuffled (flipped dimensions or whatever) somehow.

EDIT: Check https://github.com/fchollet/keras/issues/2378#issuecomment-211910392

carlthome commented 8 years ago

Grasping at straws here, but some optimizers are stateful right? Are you just using SGD? I'm not familiar with this part of Keras but perhaps the optimizers should be saved as well because otherwise when you reinitiate learning and start a new epoch but with pretrained weights instead of your original weight initialization, perhaps training diverges due to high learning rates.

NasenSpray commented 8 years ago

Run this plz

model  = make_model()
w1 = model.get_weights()
model.load_weights('your_saved_weights.h5')
w2 = model.get_weights()

for a,b in zip(w1, w2):
  if np.all(a == b):
    print "wtf is happening"

Does it print?

trane293 commented 8 years ago

Doesn't print. The weights are loaded successfully I suppose. Its the training procedure that's problematic. After running this script (it didn't print anything), I ran model.fit() and it started with a loss 10x higher than originally it was at epoch 1, and with accuracy 20% again sigh

carlthome commented 8 years ago

Obviously something must be different as you're seeing different results. Perhaps get_weights() doesn't actually return everything it could.

I'm curious if you have the same problem just by restarting training in the same session, nevermind loading a model and its weights with Keras' builtins. If not, consider saving states with something like this instead.

trane293 commented 8 years ago

Thanks. When I restart an interrupted training process, the training continues from where it left of successfully. The problem is when I load the model and weights.

My main aim to save "snapshots" or "states" of the model that can be loaded back and used as a starting point when training on the other day. I'll have a look at the shelf module too, thanks!. But I think the problem with Keras must be debugged as well.

Please guide me on how can I help you reproduce this issue for you guys to fix it sometime in the future. I would love to help.

NasenSpray commented 8 years ago

In your loadModel(), hardcode the learning rate to 0. Does it make the loaded model better?

- and -

Instead of training, just evaluate the loaded model on the training set. Still worse than before?

trane293 commented 8 years ago

I'll try your suggestion as soon as I get to the institute tomorrow.

trane293 commented 8 years ago

I could not try validating on the training set for some reason, but I solved my problem by pickling the model after training it for the day. I restarted my iPython Notebook kernel, loaded the pickled model, and restarted the training process. Fortunately it started from where it left of.

I will also try your suggestion and report back what I got.

carlthome commented 8 years ago

Dang! That clearly means some states are not saved properly, whether they are weights or something else.

I assume the intended use case for having load and save functions in Keras has more to do with being able to share pretrained models like people do with Caffe, rather than it being for pausing your own training, in which pickling is probably safer.

I do wonder though if it wouldn't be easier to just scratch the manual parsing of states which is bug-prone and simply have everything rely on Python's builtin object serialization with pickle, shelve or similar. Keras' builtins are pretty meaty though so I'm probably missing something important in why they're needed.

I could do a pr with shelve for save_model(...), load_model(...), save_weights(...), load_weights(...) if it is of interest @fchollet.

tboquet commented 8 years ago

From what @carlthome said here, you could try to take a snapshot of the optimizer too. I have 2 functions working to be able to serialize the model and the optimizer as in the pre 1.0 release. Note that I return a dictionnary instead of a json dump. It's basically something really similar to the old functionnalities. You could try them and let me know if it's working (I didn't have the time to really test them extensively):


def get_function_name(o):
    """Utility function to return the model's name
    """
    if isinstance(o, six.string_types):
        return o
    else:
        return o.__name__

def to_dict_w_opt(model):
    """Serialize a model and add the config of the optimizer and the loss.
    """
    config = dict()
    config_m = model.get_config()
    config['config'] = {
        'class_name': model.__class__.__name__,
        'config': config_m,
    }
    if hasattr(model, 'optimizer'):
        config['optimizer'] = model.optimizer.get_config()
    if hasattr(model, 'loss'):
        if isinstance(model.loss, dict):
            config['loss'] = dict([(k, get_function_name(v))
                                   for k, v in model.loss.items()])
        else:
            config['loss'] = get_function_name(model.loss)

    return config

def model_from_dict_w_opt(model_dict, custom_objects=None):
    """Builds a model from a serialized model using `to_dict_w_opt`
    """
    if custom_objects is None:
        custom_objects = {}

    model = layer_from_config(model_dict['config'],
                              custom_objects=custom_objects)

    if 'optimizer' in model_dict:
        model_name = model_dict['config'].get('class_name')
        # if it has an optimizer, the model is assumed to be compiled
        loss = model_dict.get('loss')

        # if a custom loss function is passed replace it in loss
        if model_name == "Graph":
            for l in loss:
                for c in custom_objects:
                    if loss[l] == c:
                        loss[l] = custom_objects[c]
        elif model_name == "Sequential" and loss in custom_objects:
            loss = custom_objects[loss]

        optimizer_params = dict([(
            k, v) for k, v in model_dict.get('optimizer').items()])
        optimizer_name = optimizer_params.pop('name')
        optimizer = optimizers.get(optimizer_name, optimizer_params)

        if model_name == "Sequential":
            sample_weight_mode = model_dict.get('sample_weight_mode')
            model.compile(loss=loss,
                          optimizer=optimizer,
                          sample_weight_mode=sample_weight_mode)
        elif model_name == "Graph":
            sample_weight_modes = model_dict.get('sample_weight_modes', None)
            loss_weights = model_dict.get('loss_weights', None)
            model.compile(loss=loss,
                          optimizer=optimizer,
                          sample_weight_modes=sample_weight_modes,
                          loss_weights=loss_weights)
    return model

@carlthome, if this solution is ok, we could work on a PR that includes these functionnalities and the other relevant elements (weights, states, ...)? It should be possible to include all of this in a HDF5 file.

carlthome commented 8 years ago

@tboquet, cool! Sounds good to me! I'm no authority on Keras but I would probably have based loading/saving around object serialization of Model() and Sequential() just to be safe. In the future, new things will probably be stateful which will screw up things again. The slight additional overhead of saving too much is worth the extra stability and reduced code complexity, in my mind.

rtatishvili commented 8 years ago

This is what I am using (took from keras docs) and it works without a problem on Keras 1.0:

def load_model():
    model = model_from_json(open('model.json').read())
    model.load_weights('weights.h5')
    model.compile(optimizer=rmsprop, loss='mse')
    return model

def save_model(model):    
    json_string = model.to_json()
    open('model.json', 'w').write(json_string)
    model.save_weights('weights.h5', overwrite=True)

I had one example with say 10 epochs and another example with save and load in a loop of 10 iterations each with 1 epoch, and the loss for both were similarly decreasing. Additionally both resulting models were predicting fine.

Have you tried to call model.load_weights before model.compile?

trane293 commented 8 years ago

Thank you for your suggestions everyone. I will try your suggestions again and revert back what I got. If the method described on official Keras documentation works for everyone, it should for me too. I will dig a little deeper and find out if its something I am doing wrong.

carlthome commented 8 years ago

I ran into a similar problem today. It really seems like it could be the optimizer that needs to be saved/loaded too, aside from the weights.

Basically anything like this seems to go bonkers (in my case loss='mse' and optimizer='rmsprop'):

# Starting fresh, training for a while and saving the weights to file.
model = create_model()
model.compile(...)
model.fit(...)
model.save_weights(...)

# Creating the model again, but loading the previous weights and resuming training.
model = create_model()
model.compile(...)
model.load_weights(...)
model.fit(...) # Diverges!

The data is the same in both fit calls.

trane293 commented 8 years ago

@carlthome Had the same problem. Didn't check recently for the current status but now I use vanilla cPickle to pickle my trained model. Loading the pickled model and resuming training seems to be working just as expected. However I'm not sure about the JSON + h5 weight saving/loading functionality. If you are having the same problem then there must be something wrong.

NasenSpray commented 8 years ago

@carlthome: RMSprop makes really shitty updates during the first couple of steps which easily wreck pre-trained models. Could you retry with plain SGD?

Rocketknight1 commented 8 years ago

I also encountered this problem training a 2-layer LSTM with one dense layer at the end. Testing showed the following:

-Compiling two identical models in the same script, training the first model and then loading the weights in the second model via save_weights and load_weights worked as it should even if the two models had separate optimizer instances. If I did this and then started training with the second model its training loss was the same as the training loss of the first model when the weights were saved, as expected.

-However, once Python was closed and reopened loading weights saved in the previous instance resulted in, if anything, a -worse- loss at the start of training than the untrained model, though it quickly learned again.

-I'm not sure if the optimizer is at fault, because I've tried saving the weights from a model, reloading them and then testing predictions without any further training. If the two models were compiled in the same session it works fine, but if I close the session, start a new session, compile a new model and load the previous session's weights then its predictions are garbage.

Rocketknight1 commented 8 years ago

Also, I'm using the Theano backend and training on Windows with CUDA, which is probably a weird use-case. Not sure what backend/OS the other people with this problem are using.

Rocketknight1 commented 8 years ago

Wait, I've been -extremely- stupid, please ignore.

For the interested, I was making a character-prediction RNN with a one-hot character encoding, but instead of pickling the map of characters to one-hot indices I was generating it in the code each time from a set of allowed characters using enumerate(). This of course meant that the mapping generated by enumerate() was different every time, because sets have no guaranteed order, which explains why everything worked fine until I restarted the script (and so regenerated the mapping).

This is embarrassingly obvious in retrospect.

IanLondon commented 8 years ago

I'm having this same issue using adagrad. After hours of training, when I load weights and resume, my MSE goes back up to where it started on the first epoch (the first epoch ever, where it was using the initial random weights).

What's the disadvantage of using vanilla cPickle instead of the save_weights and to_json (which don't seem to work unless you're using SGD)?

thomasgolda commented 8 years ago

Heyho. I'm new to Deep Learning and Keras and ran into the same / a similar issue. I trained my model with SGD for some time and saved the weights after each epoch using the save_weights() function. When I load weights from a particular epoch and I use SGD again, everything is fine (evaluation metrics are still good).

Additionally, I tried to use my already learned weights but use a different optimizer for further training. When choosing Adam, Adagrad or RMSprop the evaluation metrics dropped and it looked like as if the learning started from scratch.

How can this happen? Why is everything fine, when I use SGD again - even with changed learning rate - but not when using a different Optimizer?

Thanks for your help!

EDIT:

@carlthome: RMSprop makes really shitty updates during the first couple of steps which easily wreck pre-trained models. Could you retry with plain SGD?

@NasenSpray Hmm. Could this be my problem? As far as I know all my chosen optimizers are related to RMSprop. Could they all 'destroy' my already learned weights and affect the performance in such negative way?

trane293 commented 8 years ago

It generally not advisable to retrain a pretrained model on an altogether different optimizer compared to what it was trained on. This just doesn't make any sense. My question is - do you have a valid reason behind this setting, where you want to train a pre-trained network using a different optimizer like RMSProp or Adam?

On Tue, Aug 9, 2016 at 6:17 PM, Thomas notifications@github.com wrote:

Heyho. I'm new to Deep Learning and Keras and ran into the same / a similar issue. I trained my model with SGD for some time and saved the weights after each epoch using the save_weights() function. When I load weights from a particular epoch and I use SGD again, everything is fine (evaluation metrics are still good).

Additionally, I tried to use my already learned weights but use a different optimizer for further training. When choosing Adam, Adagrad or RMSprop the evaluation metrics dropped and it looked like as if the learning started from scratch.

How can this happen? Why is everything fine, when I use SGD again - even with changed learning rate - but not when using a different Optimizer?

Thanks for your help!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/2378#issuecomment-238541753, or mute the thread https://github.com/notifications/unsubscribe-auth/AJzCfX6axe8wAHs_UFXuAcreh9gLx69eks5qeHbcgaJpZM4IJZgO .

thomasgolda commented 8 years ago

Does this also apply to partially pretrained models? For example if you have a network with 5 convolutional layers and you take the weights for the first 3 layers from a pretrained network (transfer learning) and set trainable=False for those layers?

Concerning your question: As I wrote, I'm new to Keras and Deep Learning. I'm trying to get a feel for different techniques, so I'm playing around a bit and observing the resulting effects and trying to understand the behaviour.

shriphani commented 7 years ago

Sorry if I'm bumping an old thread - is this resolved for you folk?

moof2k commented 7 years ago

Just hit this myself. I think the confusion is model.load_weights() only loads the model weights, it does not load any of the intermediate state of the optimizer. If you want to completely reload a model the only option appears to be load_model().

I would propose closing this issue as works-as-designed and updating the FAQ to make this a little bit more clear: if you want to resume training, your best option is load_model().

shriphani commented 7 years ago

load_model isn't documented: https://keras.io/models/about-keras-models/

shriphani commented 7 years ago

But it is documented here: https://keras.io/getting-started/faq/

liuaifu commented 7 years ago

I have the same problem on keras 1.2.0. It was fixed on 1.2.1.

farahanams commented 7 years ago

is this fixed?

shriphani commented 7 years ago

I've been using the latest version of keras. I can confirm this problem is not fixed even with model.save.

Ethiral commented 7 years ago

Any update?

shalabhsingh commented 7 years ago

model.save_weights() saves only the weights of the model. Instead try using model. save() and load_model() to save and reload the model respectively which saves the entire model state.

model.save("model.h5", overwrite=True) . . model = load_model("model.h5") """when reloading the model"""

unnir commented 7 years ago

@shalabhsingh does not help

jorgecarleitao commented 7 years ago

Can someone with this issue please provide a complete and minimal example that reproduces the issue? There are tests in place to check that this does not happen, so we need to understand what is different from those tests to nail it down. Try to use a dataset from Keras, so we can all easily reproduce it.

Thanks!

phugen commented 7 years ago

@Rocketknight1 Thanks, your posts made me aware I was doing the same thing. A lot of people might have this issue because the code referenced in

https://chunml.github.io/ChunML.github.io/project/Creating-Text-Generator-Using-Recurrent-Neural-Network/

gets exactly this wrong. This code section in RNN_utils.py

data = open(DATA_DIR, 'r').read()
chars = list(set(data))
VOCAB_SIZE = len(chars)

should be something like

data = open(data_dir, 'r').read()
chars = list(set(data))
chars.sort() # SORT THE CHARS so mapping is the same even when restarting the script!
VOCAB_SIZE = len(chars)

instead so that the char mapping is always the same when reading the same file in a new session.

Abhijit-2592 commented 6 years ago

Ran into the same issue. Is it sorted folks?

lionlai1989 commented 6 years ago

I saved a model with model.save(mdl), and load it with the following codes which works great.

    if mdl == None:
        model = Sequential()
        model.add(Dense(256, input_dim=n_input, activation='relu'))
        model.add(Dropout(0.5))
        model.add(Dense(256, activation='relu'))
        model.add(Dropout(0.3))
        model.add(Dense(128, activation='relu'))
        model.add(Dropout(0.1))
        model.add(Dense(n_classes, activation='softmax'))

        model.compile(loss='categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    else:
        model = load_model(mdl)
        # Then train it with a usual way
        model.fit_generator(...)
FMFluke commented 6 years ago

@lionlai1989 Can you verify saving it from one session, and load it into different session? Or even load it and make predictions in different machine entirely? model.save and load_model does not work for me. If I load the model, the accuracy goes back to like it has never been trained.

hypnopump commented 6 years ago

Got the same issue. Fuck it, It's not solved. Spent 18 hours training a DenseNet on AWS to get to 89% accuracy on Cifar10, the connection interrupted but I thought I was safe because I had my model saved every 30 epochs. The truth is that it works for model.test(), but when I try model.fit(), it breaks and reverts to 10% accuracy when it was 89%. I've lost 1 day of work due to this shitty issue.