Why reset model.history? / comprehensive model state saving

i3v commented 7 years ago

history

I wonder, why each fit(...) call resets history property. (Same happens for fit_generator of course). Moreover, in addition to that, callbacks.History also resets itself in on_train_begin for some reason. This looks like the lifespan of the History object is intentionally limited to a single fit(...) call. But I don't actually get why. Probably, just because it is also an output argument?

This behavior looks reasonable for "transfer learning", but not convenient if user wishes to:

Save the state of the model to a disk, than load it and continue training as if without save-load.
Train one epoch at a time, without similar side-effects.

Personally, I'd like to do both. And I found that it doesn't "just work" in the 2.0.4. Currently, AFAIU, there's no convenient, out-of-the-box, way to do either.

The "history" behavior could be fixed by replacing this line with:

    if initial_epoch==0:
        self.history = cbks.History()

Also, History.on_train_begin should be changed somehow (renamed to __init__?).

So... Am I missing something? Would such changes break anything? Or is it breaking "the overall way how things ought to work here"? If so, what is a nice example of initial_epoch use? Is there a chance, that a PR, implementing this, would be accepted?

callbacks

The history property is not the only thing which performs a "reset" in on_train_begin. Most callbacks do that. Even in docs on_train_begin is used like __init__ and resets the state to initial. This should be modified in order to fully support those two use cases above. Though these modifications look like a separate, (and much larger) piece of work.

Approach 1 :
- Callbacks like ReduceLROnPlateau could be, potentially, patched, to automatically restore correct state in on_train_begin, based on model.history (thus, behave like before if history is empty), instead of blindly resetting their state. This would allow user to manually save-load-adjust their state. As well as adding some additional built-in save_callbacks_state method.
- User might intentionally want to reset ReduceLROnPlateau state. Probably, this could be easily achieved by simply making _reset method "public".
Approach 2:
- Most callbacks (e.g. ReduceLROnPlateau) treat on_train_begin event as "reset, start from scratch". But, say, CSVLogger makes an attempt to "continue" - to re-open the file that was used before. On the other hand, it actually needs this call to "continue", to open the output file. Thus, we cannot just make launching on_train_begin methods conditional (like proposed for history above). Neither we can extract the "inner part" of the fit_generator method to a separate method. So, it looks like, the essence of on_train_begin is not fully clear, and it might be a good idea to separate it into on_train_begin + on_train_continue or to on_train_reset + on_train_continue, to support desired use cases, discussed above. After that the "inner part" of the fit_generator could be "extracted".

Those 3 issues I've mentioned in this text make me think that this functionality might be interesting for some users. But significant amount of changes seem to be required. Or.. may be I'm missing something, and there's some easier way?

existing workarounds

"individual chunks" of history could be easily concatenated outside keras, if needed.
built-in callbacks could be ignored, and only used as an example for creating their customized versions, that support "comprehensive state save-load" and "one epoch at a time training" or whatever.

Thus, keras is already flexible enough.

jpeg729 commented 7 years ago

I would simply like to mention that save_model() saves the optimizer state, and it does seem illogical not to save the training history too.

SebastianB12 commented 7 years ago

Does that mean, that currenctly the optimizer state is reset every time I call .fit()? For Example: for epoch in range(1000): model.fit(); prediciton = model.predict(...);

I'm often doing this to extensively analyze the model's predictions after every epoch. But if I understand that issue correctly, this approach would reset the optimizer after every epoch and therefore would not make use of decay and so on? Or is initial_epoch solving this?

i3v commented 7 years ago

@jpeg729 , A do agree with you, however, when I look at callbacks, I have a feeling that devs never considered "comprehensive model state saving". So, it would be nice to hear from @fchollet (or someone else, who's "able to see the big picture") - do they consider this as "not yet implemented" or "this functionality is out of scope of this project". Is there some roadmap/milestones for this or something? Or, maybe, such PRs (first for history, next for callbacks) would be OK, say, for keras-contrib? I have a feeling that this is more an architectural choice, than straightforward coding.

@SebastianB12 ,

decay effect only depends on optimizer.iterations. You can check, that after you call model.fit twice, model.optimizer.iterations.get_value() is doubled. So, you're safe, at least if you do not save your model to disk and reload it.
If you do save/load - save_model docstring says that that optimizer's state is saved. But... It looks like this is not tested in "test_model_saving.py" - as far as I can see they only test that the model weights are loaded OK, and we're still able to predict(...).
The thing I'm talking about it that states are not saved (and even reset on each fit call) for "callback" objects. Thus, if you use ReduceLROnPlateau or something, the behavior might be different from what you expect.

janluke commented 7 years ago

Keras really needs a solution for this issue. It's really surprising one cannot pause/resume a training loop safely, considering that the training of a model can take days or weeks. Ok, one can if he basically rewrites the training loop and give up using the callbacks mechanism. But, honestly, I don't see a reason to prefer keras to other high-level frameworks if I cannot use the callbacks.

I think that the root problem is that there's not an abstraction of the training loop, e.g. the Trainer class in Chainer. If you think about it, it's also really ugly to save the optimizer state (and maybe in future the callbacks state and the history) together with the model parameters using a method called "save_model". On the other hand, maybe introducing a "Trainer class" at this point of the development would be a change too big.

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

PythEsc commented 6 years ago

Do you guys know of any news regarding this problem?

I am currently trying to develop a binary classifier with keras that will be trained on some initial data. Then during runtime more and more training data gets available and I want to continue the training with this new data. If I got it the right way a simple call of .fit() will reset the optimizer's attributes (e..g learning rate)?

Did you guys find any working solution without rewriting big parts of keras? I really need to find a solution for this issue otherwise I won't be able to use keras as a framework (even though I am not sure if other framework will work)

colllin commented 6 years ago

@i3v, did you end up finding a suitable solution? Or have your ideas evolved about what would be useful here?

i3v commented 6 years ago

@collin, nothing changed, AFAIU. For now I simply:

avoid built-in callbacks (copy-pasted those i actually use to my own code, and manage their state/save/load myself)
train one epoch at once, manually collect "history" output

Not to many lines of code to write, but still feels a bit weird.

ViaFerrata commented 6 years ago

I think the question by @SebastianB12 also hasn't been answered yet, right?

Does Keras reset the optimizer if you call any of the fit functions in a loop instead of using the epochs argument? If yes, I'd be really surprised that training my nets worked fine so far..

SebastianB12 commented 6 years ago

@ViaFerrata : The question was already answered by @i3v. I also double checked it, and Keras does indeed not reset the optimizer as long as you do not save and reload the model. But thanks for double checking on that as well!

pGit1 commented 6 years ago

I hate Keras. I love Keras.

Should be a simple parameter in .fit command i.e. (weight_state=OneOf['reset after succisive calls','dont reset'], optimizer_state=OneOf['reset','dont reset'])). This way people could re-initialize weights after a training loop OR not and do the same for an optimizer state without having to do hacky stuff like creating variables in memory and re instantiating after each loop iteration, saving to disk and reloading, etc.

Still an awesome if not the best DL lib out there in my opinion.

guifereis commented 6 years ago

This is still an issue for me as well. Obviously it is not particularly difficult to work around the history issues (I just write a function to wrap around .fit or .fit_generator), but it does seem somewhat essential functionality and so a bit boggling Keras does not choose to adopt it, especially (imo) in keeping things pythonic and non-hacky -- and this must be a part of any successful DL library, as Keras has already shown in other areas.

r8drascal commented 6 years ago

@SebastianB12 So it does reset when you save and load. How can I continue training after loading a model if I'm using an Adam optimizer?

SebastianB12 commented 6 years ago

@r8drascal : I did not try that. But from the other comments above I assumed, that Keras really resets the optimizer when the model is saved/loaded. Unfortunately, I currently do not have the time to test it. However, if I understand the newest FAQ correctly and understand the save_model method correctly, it kind of saves the optimizer state. #https://github.com/keras-team/keras/blob/master/keras/models.py . `

    if include_optimizer and hasattr(model, 'optimizer'):
        if isinstance(model.optimizer, optimizers.TFOptimizer):
            warnings.warn(
                'TensorFlow optimizers do not '
                'make it possible to access '
                'optimizer attributes or optimizer state '
                'after instantiation. '
                'As a result, we cannot save the optimizer '
                'as part of the model save file.'
                'You will have to compile your model again '
                'after loading it. '
                'Prefer using a Keras optimizer instead '
                '(see keras.io/optimizers).')
        else:
            f.attrs['training_config'] = json.dumps({
                'optimizer_config': {
                    'class_name': model.optimizer.__class__.__name__,
                    'config': model.optimizer.get_config()
                },
                'loss': model.loss,
                'metrics': model.metrics,
                'sample_weight_mode': model.sample_weight_mode,
                'loss_weights': model.loss_weights,
            }, default=get_json_type).encode('utf8')

            # Save optimizer weights.
            symbolic_weights = getattr(model.optimizer, 'weights')
            if symbolic_weights:
                optimizer_weights_group = f.create_group('optimizer_weights')
                weight_values = K.batch_get_value(symbolic_weights)
                weight_names = []
                for i, (w, val) in enumerate(zip(symbolic_weights,
                                                 weight_values)):
                    # Default values of symbolic_weights is /variable
                    # for Theano and CNTK
                    if K.backend() == 'theano' or K.backend() == 'cntk':
                        if hasattr(w, 'name'):
                            if w.name.split('/')[-1] == 'variable':
                                name = str(w.name) + '_' + str(i)
                            else:
                                name = str(w.name)
                        else:
                            name = 'param_' + str(i)
                    else:
                        if hasattr(w, 'name') and w.name:
                            name = str(w.name)
                        else:
                            name = 'param_' + str(i)
                    weight_names.append(name.encode('utf8'))
                optimizer_weights_group.attrs['weight_names'] = weight_names
                for name, val in zip(weight_names, weight_values):
                    param_dset = optimizer_weights_group.create_dataset(
                        name,
                        val.shape,
                        dtype=val.dtype)
                    if not val.shape:
                        # scalar
                        param_dset[()] = val
                    else:
                        param_dset[:] = val

`

Can anyone more knowledgable than me confirm that?

r8drascal commented 6 years ago

@SebastianB12 That's strange.. my model does not seem to work even though I have the latest versions of keras (2.1.5) and tensorflow (1.6.0).

opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01)
rnn_model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])
rnn_model.fit(X, Y, batch_size = 5, epochs=20)
rnn_model.save('./models/my_model.h5')

#This predicts correctly
model = load_model('my_model.h5')
model.predict(x)

#This does NOT predict correctly
model=load_model('my_model.h5')
model.fit(X, Y, batch_size = 5, epochs=1)
model.predict(x)

pGit1 commented 6 years ago

The second model does not predict correctly because the weights are getting updated after the one epoch of training...

Are you expecting the weights to be fixed despite training?

On Tue, Mar 20, 2018 at 6:18 PM, r8drascal notifications@github.com wrote:

@SebastianB12 https://github.com/sebastianb12 That's strange.. my model does not seem to work even though I have the latest version of keras (2.1.5) and tensorflow (1.6.0).

opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01) rnn_model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"]) rnn_model.save('./models/my_model.h5')

This predicts correctly

model = load_model('my_model.h5') model.predict(x)

This does NOT predict correctly

model=load_model('my_model.h5') model.fit(X, Y, batch_size = 5, epochs=1) model.predict(x)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras/issues/6697#issuecomment-374776298, or mute the thread https://github.com/notifications/unsubscribe-auth/ANU-ShJYGr6ix-sAI6LxaGjdfj7pCxRXks5tgYBMgaJpZM4NhZV1 .

r8drascal commented 6 years ago

@pGit1 By not predicting correctly I mean the predictions are completely off as if it's an untrained model. I understand how gradient descent works.

r8drascal commented 6 years ago

Update

I haven't figured out the root of the problem. But it seems that the model that I was loading was saved on Keras 2.0.6 and I am loading it on to Keras 2.1.5. Something with the "save_weights" and "load_weights" functions was not working, so I had to load the weights layer by layer on an architecture I built from scratch manually (loading the architecture from the saved model using json worked as well):

for layer_loaded, layer_built in zip(loaded_model,built_model):
   layer_built.set_weights(layer_loaded.get_weights())

plaffitte commented 6 years ago

@r8drascal Wait so the example you gave in your previous comment above was using the model saved on Keras 2.0.6? Did you get a chance to try again with a model compiled with Keras 2.1.5 ?

r8drascal commented 6 years ago

@plaffitte Sorry, I wasn't clear. Basically, I loaded the old model and saved it in Keras 2.1.5 and reloaded the new one, which wasn't working. This would've been the full code structure--I missed the loading of the old model in the first line.

##Loading Keras 2.0.6 model##
rnn_model = load_model('old_model.h5') 
opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01)
rnn_model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])
rnn_model.fit(X, Y, batch_size = 5, epochs=20)
model.predict() ##The predictions look decent

rnn_model.save('./models/my_model.h5') ##Saving it in Keras 2.1.5##

##These predictions are of the same quality as before##
model = load_model('my_model.h5') ##Loading Keras 2.1.5 model##
model.predict(x)

##These predictions are way off as if it's a completely untrained model
model=load_model('my_model.h5') ##Loading Keras 2.1.5 model##
model.fit(X, Y, batch_size = 5, epochs=1) 
model.predict(x)

plaffitte commented 6 years ago

@r8drascal Have you tried printing out the values of the learning rate (with the get_value() method or smth I guess) ? I've had a model running for the past couple of days so am not able to test it myself sorry...

r8drascal commented 6 years ago

@plaffitte The original model had a learning rate of 0.01. However, the optimizer I'm compiling with is 0.0001 as indicated in my code. I tried compiling without defining the optimizer (i.e. by using the loaded model optimizer) and the results were worse. What's strange is that when I run the program on my course server (deeplearning.ai Coursera), which is using Keras 2.0.7, everything runs perfectly with the above code.

pGit1 commented 6 years ago

Super weird. Maybe a bug?

On Sun, Mar 25, 2018 at 10:34 PM, r8drascal notifications@github.com wrote:

Update

I haven't figured out the root of the problem. But it seems that the model that I was loading was saved on Keras 2.0.6 and I am loading it on to Keras 2.1.5. Something with the "save_weights" and "load_weights" functions was not working, so I had to load the weights layer by layer on an architecture I built from scratch manually (loading the architecture from the saved model using json worked as well):

for layer_loaded, layer_built in zip(loaded_model,built_model): layer_built.set_weights(layer_loaded.get_weights())

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/keras-team/keras/issues/6697#issuecomment-376029793, or mute the thread https://github.com/notifications/unsubscribe-auth/ANU-Spr8-j7GX3vK1rzGChACzuJYU2amks5tiFPSgaJpZM4NhZV1 .

jasoriya commented 6 years ago

@i3v Can you share a snippet of your custom callback?

CMCDragonkai commented 6 years ago

I accumulate the history on each fit call into a histories variable.

It sort of makes sense that a single "fit" which has a number of epochs, has its own history because the next time you cal fit you may have changed characteristics of the model, like unfreezing layers.. etc.

But I would like a comprehensive checkpoint restore style in Keras which you can stop training with a SIGINT and then restart the training again exactly where it left off. Preserving the epoch index, the training weights, the optimiser the state, the hyperparameters... etc.

cseas commented 5 years ago

Can someone please explain in detail how to resume training on a partially trained model when using the ReduceLROnPlateau callback?

Say, I trained the model to 50 epochs, then saved the model using the ModelCheckpoint callback as an hdf5 file.

Now I want to resume training from epoch 51 by using load_model() and then model.fit(). The problem is I'm getting unexpected accuracies on resuming training, which is probably because the callbacks are reset on a model save/load.

I've found references as to how a histories variable can be used to restore the callbacks and the last learning rate used, but I'm not sure how to do it. Can someone give an example?

CognitiveClouds-Prasad commented 5 years ago

Is it even possible to do online learning in Keras? I am not very sure (even though it is 2019).

dorukkarinca commented 4 years ago

I know the issue is kind of stale, but I wrote a wrapper that periodically saves and auto-concatenates the model history dict from a pickle file, as well as the last epoch number and model weights. I'd love to hear your thoughts.

pip install keras-buoy https://github.com/dorukkarinca/keras-buoy

Sreerag-ibtl commented 3 years ago

Is CSVLogger callback an alternative for this? Or do we still need to manually pickle and append the history object for smooth plotting?

Sreerag-ibtl commented 3 years ago

Can someone please explain in detail how to resume training on a partially trained model when using the ReduceLROnPlateau callback?

Say, I trained the model to 50 epochs, then saved the model using the ModelCheckpoint callback as an hdf5 file.

Now I want to resume training from epoch 51 by using load_model() and then model.fit(). The problem is I'm getting unexpected accuracies on resuming training, which is probably because the callbacks are reset on a model save/load.

I've found references as to how a histories variable can be used to restore the callbacks and the last learning rate used, but I'm not sure how to do it. Can someone give an example?

I had the same doubt. From my understanding, Keras can resume training from the latest learning rate. Here's the StackOverflow thread discussing the issue.

keras-team / keras