tensorflow / transform

Input pipeline framework
Apache License 2.0
983 stars 214 forks source link

Table not initialized when serving model #237

Open rcrowe-google opened 3 years ago

rcrowe-google commented 3 years ago

Posting for @awadalaa

We are blocked on experimenting with a new Tensorflow model in production because it fails to inference with this error:

tensorflow.python.framework.errors_impl.FailedPreconditionError: Table not initialized.

We have narrowed down the issue to a bit of our code that applies a bm25 transformation in a Tensorflow-Transform job. As part of applying that transformation, it learns and applies a vocabulary however when we inference the model it fails to initialize the table from that vocabulary file on this line. Here is the BM25 code we are using and the line where it fails: https://gist.github.com/awadalaa/e9290cf6674884d8e197fe315ed7d832#file-gistfile1-txt-L176-L177

More background: We run a Tensorflow-Transform Beam/Dataflow job that executes this transformation and saves the transform graph. Later when we train our model, we save it with a signature that applies the TFT layer: transformed_features = model.tft_layer(parsed_features). We noticed that the exported model/assets directory does not include the intermediate vocabulary used by the above BM25 transformation although it does include every other vocabulary file learned in the TFT job. Any ideas why the above transformation would fail to export the vocabulary assets for a saved model?

Stack trace here:

Traceback (most recent call last):
File "/Users/aawad/Desktop/keras_predict.py", line 174, in <module>
print("prediction_output", predict(inference_data))
File "/usr/local/opt/python@3.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1655, in __call__
return self._call_impl(args, kwargs)
File "/usr/local/opt/python@3.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1673, in _call_impl
return self._call_with_flat_signature(args, kwargs, cancellation_manager)
File "/usr/local/opt/python@3.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1722, in _call_with_flat_signature
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "/usr/local/opt/python@3.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow/python/saved_model/load.py", line 106, in _call_flat
cancellation_manager)
File "/usr/local/opt/python@3.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/usr/local/opt/python@3.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 550, in call
ctx=ctx)
File "/usr/local/opt/python@3.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Table not initialized.
[[{{node StatefulPartitionedCall/StatefulPartitionedCall/transform_features_layer/StatefulPartitionedCall/transform/apply_haystack_vocabulary_query_ngram_substrings_tags_ngram_substrings/hash_table_Lookup/LookupTableFindV2}}]] [Op:__inference_signature_wrapper_23443]

Function call stack: signature_wrapper

varshaan commented 3 years ago

What is the version of TFT being used?

abhijeetrao1988 commented 3 years ago

apache-beam[gcp]==2.28.0 tensorflow-transform==0.28.0 tensorflow==2.4.1

varshaan commented 3 years ago

Re: "We noticed that the exported model/assets directory does not include the intermediate vocabulary used by the above BM25 transformation" --> is this the model exported post training or the output of TFT? If the former, could you clarify if the file exists in the transform output?

awadalaa commented 3 years ago

hi @varshaan! Thank you for looking into this. The TFT dataflow job does export the assets. I see the vocab file under transform_fn/assets/needle_vocabulary

However, these vocab files do not appear in the trained model's model/assets/ directory. Both the TFT job and Training jobs were successful. We only noticed the error when attempting to reload and inference the model.

I also managed to reproduce the issue using this transformation:

def get_tfidf(self, feature_dict: Dict[str, tf.Tensor]) -> Dict[str, tf.Tensor]:
    outputs = dict()
    VOCAB_SIZE = 100000
    DELIMITERS = ".,!?() "
    for key, feature in feature_dict.items():
        word_tokens = tf.compat.v1.string_split(feature, DELIMITERS)
        word_indices = tft.compute_and_apply_vocabulary(
            word_tokens, top_k=VOCAB_SIZE
        )
        bow_indices, tfidf_weight = tft.tfidf(word_indices, VOCAB_SIZE + 1)
        tfidf_score = tf.math.reduce_mean(tf.sparse.to_dense(tfidf_weight), axis=-1)
        outputs[f"{key}_tfidf_score"] = tf.where(
            tf.math.is_nan(tfidf_score), tf.zeros_like(tfidf_score), tfidf_score
        )
    return outputs

In both cases (bm25 and tfidf), it seems to fail at prediction time on the apply_vocabulary step. For example the above transformation failed with:

tensorflow.python.framework.errors_impl.FailedPreconditionError:  Table not initialized.
     [[{{node StatefulPartitionedCall/StatefulPartitionedCall/transform_features_layer/StatefulPartitionedCall/transform/compute_and_apply_vocabulary_1/apply_vocab/hash_table_Lookup/LookupTableFindV2}}]] [Op:__inference_signature_wrapper_2787]
varshaan commented 3 years ago

Since the table does exist in the Transform output, do you mind sharing the code snippet for how the trained model is being exported? In particular, is the tft_layer assigned to an attribute of the exported model [1]? I am assuming this is a Keras model from the stacktrace.

[1] https://github.com/tensorflow/transform/blob/master/examples/census_example_v2.py#L120

awadalaa commented 3 years ago

yep, it's a Keras model. The TFT layer is attached as an attribute of the keras model:

model.tft_layer = self.tft_transform_output.transform_features_layer()

This is the bit of code where we export the model https://gist.github.com/awadalaa/bcafb5da46ced7d9373f0d51ce389aa3#file-gistfile1-txt-L24

awadalaa commented 3 years ago

hi @varshaan I put together a small example repository that consistently reproduces the issue based on the census example you linked: https://github.com/awadalaa/TFTReproduceIssue

you can clone the repo and run this to reproduce the problem:

pip install -r requirements.txt
python -m data.task
python -m trainer.task
python -m inference.task
varshaan commented 3 years ago

Hi, That repro has 2 keras models. The "full_model" [1] does not track the tft layer. Adding full_model.tft_layer=self.tft_transform_output.transform_features_layer() after l69 in [1] fixes the repro. Normally no asset files would have been exported to the trainer model. However, since you define categorical feature columns for all the vocabularies other than the ones used to evaluate tfidf, the feature columns ended up tracking those asset files in the full_model and hence they got exported fine. The missing asset files evaluate features defined as numeric columns and hence this tracking through the feature columns didn't exist for them.

[1] https://github.com/awadalaa/TFTReproduceIssue/blob/main/trainer/model.py#L69

rcrowe-google commented 3 years ago

@awadalaa Does that fix the problem? If so then we should close this issue.

awadalaa commented 3 years ago

thank you @rcrowe-google and @varshaan! Attaching the tft_layer to the full_model does unblock us!

I'm not sure if the issue should be closed though. It was unexpected because the tft_layer was attached through the prediction signature and the predictions failed when using the signature. I would have expected that failure mode if I had made the predictions using the model.predict or model.__call__ explicitly but not when using the prediction signature. Any reason why the full_model needs to track the tft_layer here rather than rely on the prediction signatures tft_layer?

varshaan commented 3 years ago

My understanding is that Keras expects that all resources that need to be tracked are tracked by the main object that is being saved (in this case the full_model). I suspect it isn't common that the signatures are on a model different from the one being saved. I will try and verify this and get back to you.