kubeflow / code-intelligence

ML-Powered Developer Tools, using Kubeflow
https://medium.com/kubeflow/reducing-maintainer-toil-on-kubeflow-with-github-actions-and-machine-learning-f8568374daa1?source=friends_link&sk=ac77444f00c230e7d787edbfb0081918
MIT License
55 stars 21 forks source link

FailedPreconditionError op not initialized #89

Open jlewi opened 4 years ago

jlewi commented 4 years ago

From #70; I'm observing the following errors when running the inference model in pubsub workers.

The first couple of predictions succeed but then it starts failing.

This looks like a threading issue. The first successful predictions happen in one thread and the failed predictions happen in another thread. I logged the thread number to confirm this.

Not sure why we didn't observe this in the original code or what's different about my code https://github.com/machine-learning-apps/Issue-Label-Bot/blob/master/flask_app/utils.py

   Traceback (most recent call last):
    File "/py/label_microservice/worker.py", line 145, in callback
      predictions = self._predictor.predict(data)
    File "/py/label_microservice/issue_label_predictor.py", line 152, in predict
      model_name=data.get("model_name"))
    File "/py/label_microservice/issue_label_predictor.py", line 114, in predict_labels_for_issue
      model_name, data.get("title"), data.get("body"))
    File "/py/label_microservice/issue_label_predictor.py", line 74, in predict_labels_for_data
      predictions = model.predict_issue_labels(title, body)
    File "/py/label_microservice/combined_model.py", line 34, in predict_issue_labels
      latest = m.predict_issue_labels(title, text)
    File "/py/label_microservice/universal_kind_label_model.py", line 84, in predict_issue_labels
      probs = self.model.predict(x=[vec_body, vec_title]).tolist()[0]
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 908, in predict
      use_multiprocessing=use_multiprocessing)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 723, in predict
      callbacks=callbacks)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 394, in model_iteration
      batch_outs = f(ins_batch)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3476, in __call__
      run_metadata=self.run_metadata)
    File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
      run_metadata_ptr)
    tensorflow.python.framework.errors_impl.FailedPreconditionError: Error while reading resource variable dense_5/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/dense_5/bias/N10tensorflow3VarE does not exist.
issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the label kind/bug to this issue, with a confidence of 0.89. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

jlewi commented 4 years ago

Ref: keras-team/keras#5640

jlewi commented 4 years ago

It looks like doing the following might fix it

  with self._graph.as_default() as graph:
      with tf.Session(graph=graph) as sess:
        init=tf.global_variables_initializer()
        sess.run(init)
        probs = self.model.predict(x=[vec_body, vec_title]).tolist()[0]
kf-label-bot-dev[bot] commented 4 years ago

Issue Label Bot is not confident enough to auto-label this issue. See dashboard for more details.

jlewi commented 4 years ago

I'm not convinced that actually worked; my suspicion is that the model is no longer loaded and we are using random weights.

kf-label-bot-dev[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.89

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

jlewi commented 4 years ago

Yeah looks like that wasn't loading the actual weights. As soon as I changed it to load the model on each predict call I started getting much better results.

As a hack just reload the model.

hamelsmu commented 4 years ago

I encountered these genre of issues when building Issue Label Bot for the first time, feel free to take a look at https://github.com/machine-learning-apps/Issue-Label-Bot/blob/master/flask_app/app.py Incase their is a recipe there that might help

jlewi commented 4 years ago

Thanks @hamelsmu I had looked at https://github.com/machine-learning-apps/Issue-Label-Bot/blob/master/flask_app/app.py and couldn't figure out what it was doing differently that multi-threading doesn't seem to be an issue.

hamelsmu commented 4 years ago

@jlewi I think I'm lost with some of the code changes. Can you point me to the flask app code that is serving the Label Microsservice? I can't seem to find it anywhere in master?

jlewi commented 4 years ago

Here is an Architecture Diagram

There are basically two pieces

hamelsmu commented 4 years ago

@jlewi I have an idea how to fix this (I would test it myself, but not sure how to test the microservice):

# import set session
import tensorflow.compat.v1.keras.backend.set_session as set_session
# When you initialize the model
self.session = tf.Session(graph=tf.Graph())
with self.session.graph.as_default():
    set_session(session)
    self.model = keras_models.load_model(model_path)
# When you make the prediction
with self.session.graph.as_default():
    set_session(session)
    self.model.predict(...)
hamelsmu commented 4 years ago

Oh and sorry for making you repeat the documentation, I should have just looked there instead 🤦‍♂ my apologies

jlewi commented 4 years ago

Thanks @hamelsmu if you wanted to try this out; my suggestion would be to follow the developer guide https://github.com/kubeflow/code-intelligence/blob/master/Label_Microservice/developer_guide.md

That should explain how to

hamelsmu commented 4 years ago

ok I will put this on my backlog