If training fails to increase F1 above 0.0 exception thrown

antonyscerri commented 6 years ago

Hi

It appears that if using the train_reader (at least with the FastQA reader) and it fails to improve upon the initial (best metric) F1 of 0.0 it throws the following exception. Its due to the model module not being saved and then trying to restore it run the eval on the dev set.

Tony

Traceback (most recent call last): File "build_models.py", line 135, in train_reader.train(fastqa_reader,train,None,test,trainer_config) File ".../jack/jack/train_reader.py", line 21, in train train_tensorflow(reader, train_data, test_data, dev_data, configuration, debug) File ".../jack/jack/train_reader.py", line 112, in train_tensorflow reader.load(save_dir) File ".../jack/jack/core/reader.py", line 185, in load self.model_module.load(os.path.join(path, "model_module")) File "...jack/jack/core/tensorflow.py", line 169, in load self._saver.restore(self.tf_session, path) File ".../lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1717, in restore

compat.as_text(save_path)) ValueError: The passed save_path is not a valid checkpoint: ./models/model1/model_module

dirkweissenborn commented 6 years ago

if this happens, training has already stopped. However, this can only happen after max_epochs (a parameter of the training configuration dict when using train_tensorflow.py directly) is reached. This means that after max_epochs the model has never improved over 0.0 F1. So I would guess that max_epochs is set to something very low, like 1.

antonyscerri commented 6 years ago

Right now its set to 10 epochs. In this case we probably have insufficient/poor data for it to do anything with. However if it breaks with such an exception it makes it hard to see whats happening. I patched the local copy to check if the final model directory existed before trying to load and then evaluate against it, and instead output a different message.

dirkweissenborn commented 6 years ago

Perfect. The best solution to avoid this issue is to save the initial model before training will be fixed soon. Thanks for the catch.

uclnlp / jack

If training fails to increase F1 above 0.0 exception thrown #391