[Feature request] Saving model checkpoints during training

nicesetup commented 3 years ago

Hello everyone,

I am currently trying to use n2v 3D within the ZeroCostDL4Mic framework to denoise fluorescence images of calcium signals in order to make them useable for quantitative analysis. To better understand the training process, I tried to create a loop pausing the training every couple of epochs and create an intermediate prediction. However, unfortunately, storing the weights and reloading them degrades the model's performance and training that way is not equivalent to training in one go. (See this issue for details: https://github.com/HenriquesLab/ZeroCostDL4Mic/issues/50 ). I would therefore like to realize the suggestion Romain Laine made in that issue: Saving model checkpoints with a keras callback. That way, I could create "intermediate results" from the saved checkpoints later on, without having to interupt the training process. The N2V class seems to offer a way of saving checkpoints already (code lines 280-284). Would it be sufficient to configure the callback such that not just weights_best are saved, but the current weights at the end of each epoch? Or would I have to adapt the code further to achieve such a functionality? Further, would it be possible to name those checkpoints adequately, for example by passing formatstring like weights.{epoch:02d}-{val_loss:.2f}.h5 to the config-object? Any help implementing this would be greatly apreciated since I am not really familiar with Keras.

Thank you very much and best regards!

fjug commented 3 years ago

Hi @nicesetup! @alex-krull is currently very busy moving the center of his life to the UK where he's starting his own group, so I'm sorry that our support is currently taking a bit. Maybe @tibuch finds a few minutes to ready through the ZeroCost issue you posted above and has an additional idea what might cause this.

I have one random thought and a question for now:

is it possible that you use an optimizer that has a memory (Adam with inertia or something like that?)
is the code you're using somewhere on GitHub? Seeing it would allow us to really understand what you're doing... and maybe even run it ourselves.

Best, Florian

nicesetup commented 3 years ago

Hi Florian,

thank you very much for your response!

I did some further research and think your random thought is quite probably correct and the optimizer is what causes this behaviour. To pause training and resume from that exact state later on, it would therefore be required to checkpoint not just the weights, but the entire model, including the state of the optimizer. Do you think this is feaseable to implement in your N2V code? Alternatively, saving checkpoints of weights-only during training would already be helpful. The code I used is the one by the ZeroCostDL4Mic people and can be found here - I was just training for a certain amount of epochs, after completion of the training make a prediction, and then reload the model using the "Using weights from a pre-trained model as initial weights" section. We are however currently setting up to use your code from this repository on local hardware, independet of ZeroCostDL4Mic. So a hint on how we can correctly store checkpoints with your implementation would be highly appreciated!

Thank you very much and best regards.

/Update: When training on a remote machine, no matter whether on a cluster or via Google Colab, one is usually restricted by a certain allowance of computation time (e.g. max 12 hours in Colab). To alleviate this restriction, it would be very helpful if there were an option to not just checkpoints weights, but also the entire model including the state of the optimizer, as mentioned above. Keras does feature a callback doing - as far as I understand - exactly this, which might be implemented similarly to this example. Would it somehow be possible for you to implement such a feature, maybe including an argument for the training script allowing to configure callbacks (choosing to save model checkpoints, weight checkpoints or both) in a simple manner? Your input would be highly appreciated!

juglab / n2v

[Feature request] Saving model checkpoints during training #103