sogou / SogouMRCToolkit

This toolkit was designed for the fast and efficient development of modern machine comprehension models, including both published models and original prototypes.
Apache License 2.0
746 stars 162 forks source link

FailedPreconditionError: Error Loading Models #30

Closed henryfriedlander closed 5 years ago

henryfriedlander commented 5 years ago

Hi,

Thank you very much for your code. I have been able to replicate your results for many on datasets using the model.train_and_evaluate() method. However, when I have tried to save and load a model, I have experienced an error. Initially I tried to save and evaluate using the BertCoQA model, but I am even experiencing errors when running the code from model_save_load.md tutorial.

Below is the error thrown (here is a pastebin with the full error if that would be helpful).

FailedPreconditionError (see above for traceback): Attempting to use uninitialized value eval_metrics/mean/count [[node eval_metrics/mean/AssignAdd_1 (defined at /juicier/scr126/scr/hnf035/fresh/SMRCToolkit/sogou_mrc/model/bert_coqa.py:199) = AssignAdd[T=DT_FLOAT, use_locking=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](eval_metrics/mean/count, eval_metrics/mean/ToFloat, ^add_8)]]

Thank you very much for the help!

henryfriedlander commented 5 years ago

I was able to solve the problem. It is a matter of the way that Tensorflow stores LOCAL_VARIABLES. The bug is that when you save/restore a model the tf.Saver class does not save local variables (here and here are relevant SO posts). A quick solution would be to include model.session.run(tf.local_variables_initializer()) before the line model.evaluate(test_batch_generator, evaluator). I have submitted a pull request at #31 with the change.

However, this change is less than ideal. I have noticed the local variables regarding loss are all exactly the same for each model. I would propose that you move the loss code into the base_model.py's _build_graph function. Then you can initialize the local variables directly inside base_model.py's load function to abstract that perhaps unintuitive line from the user.