Closed antonyscerri closed 6 years ago
if this happens, training has already stopped. However, this can only happen after max_epochs
(a parameter of the training configuration dict when using train_tensorflow.py
directly) is reached. This means that after max_epochs the model has never improved over 0.0 F1. So I would guess that max_epochs
is set to something very low, like 1.
Right now its set to 10 epochs. In this case we probably have insufficient/poor data for it to do anything with. However if it breaks with such an exception it makes it hard to see whats happening. I patched the local copy to check if the final model directory existed before trying to load and then evaluate against it, and instead output a different message.
Perfect. The best solution to avoid this issue is to save the initial model before training will be fixed soon. Thanks for the catch.
Hi
It appears that if using the train_reader (at least with the FastQA reader) and it fails to improve upon the initial (best metric) F1 of 0.0 it throws the following exception. Its due to the model module not being saved and then trying to restore it run the eval on the dev set.
Tony
Traceback (most recent call last): File "build_models.py", line 135, in
train_reader.train(fastqa_reader,train,None,test,trainer_config)
File ".../jack/jack/train_reader.py", line 21, in train
train_tensorflow(reader, train_data, test_data, dev_data, configuration, debug)
File ".../jack/jack/train_reader.py", line 112, in train_tensorflow
reader.load(save_dir)
File ".../jack/jack/core/reader.py", line 185, in load
self.model_module.load(os.path.join(path, "model_module"))
File "...jack/jack/core/tensorflow.py", line 169, in load
self._saver.restore(self.tf_session, path)
File ".../lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1717, in restore