ufal / neuralmonkey

An open-source tool for sequence learning in NLP built on TensorFlow.
BSD 3-Clause "New" or "Revised" License
411 stars 102 forks source link

Continue training #731

Closed kocmitom closed 6 years ago

kocmitom commented 6 years ago

Hi, I need to continue a training for a purpose of adaptation. What is the easiest way to do so, considering that I need to store all training parameters (especially global step and Adam)

jindrahelcl commented 6 years ago

It seems that right now the Adam variables get stored, as well as the global step. You can check this by running and continuing a test and print out the variable before and after using tf.Print.

jindrahelcl commented 6 years ago

Alternatively, you can print out the variables that are being stored during the initialization of the saver, here: https://github.com/ufal/neuralmonkey/blob/master/neuralmonkey/tf_manager.py#L90

kocmitom commented 6 years ago

Therefore I must be doing something wrong? I only added a line to main:

initial_variables="baseline_model/variables.data.index"

This alerts me that some variables are not in the checkpoint and based on the log the global step is not set. I also tried to provide other variables.data.* files.

varisd commented 6 years ago

Drop the ".index" suffix

jindrahelcl commented 6 years ago

What do you mean "based on the log"? Global step is not logged. You must use tf.Print to get the value. The list of variables and their shapes in the log is a list of trainable variables, not global variables. Global variables get stored.

kocmitom commented 6 years ago

Perfect, thank you. Now it didn't warn about missing variables. As a log I meant the tensorboard output and the fact that performance dropped due to high learning rate (or not having correct variables).

Now it works

jindrahelcl commented 6 years ago

What works? The global step and adam vars get loaded?

kocmitom commented 6 years ago

The global step looks like it got loaded and I suppose the Adam too. Only (cosmetical) problem is, that the tensorboard do not start at the correct step but starts from zero. Couldn't it be somehow augmented so whenever globalstep is loaded the step variable in learning_utils will also increase? But it is only for better visualization.

kocmitom commented 6 years ago

I have checked it and I can confirm that global step as well as whole Adam is stored in the checkpoint