Open rcrowe-google opened 3 years ago
What is the version of TFT being used?
apache-beam[gcp]==2.28.0 tensorflow-transform==0.28.0 tensorflow==2.4.1
Re: "We noticed that the exported model/assets directory does not include the intermediate vocabulary used by the above BM25 transformation" --> is this the model exported post training or the output of TFT? If the former, could you clarify if the file exists in the transform output?
hi @varshaan! Thank you for looking into this. The TFT dataflow job does export the assets. I see the vocab file under transform_fn/assets/needle_vocabulary
However, these vocab files do not appear in the trained model's model/assets/ directory. Both the TFT job and Training jobs were successful. We only noticed the error when attempting to reload and inference the model.
I also managed to reproduce the issue using this transformation:
def get_tfidf(self, feature_dict: Dict[str, tf.Tensor]) -> Dict[str, tf.Tensor]:
outputs = dict()
VOCAB_SIZE = 100000
DELIMITERS = ".,!?() "
for key, feature in feature_dict.items():
word_tokens = tf.compat.v1.string_split(feature, DELIMITERS)
word_indices = tft.compute_and_apply_vocabulary(
word_tokens, top_k=VOCAB_SIZE
)
bow_indices, tfidf_weight = tft.tfidf(word_indices, VOCAB_SIZE + 1)
tfidf_score = tf.math.reduce_mean(tf.sparse.to_dense(tfidf_weight), axis=-1)
outputs[f"{key}_tfidf_score"] = tf.where(
tf.math.is_nan(tfidf_score), tf.zeros_like(tfidf_score), tfidf_score
)
return outputs
In both cases (bm25 and tfidf), it seems to fail at prediction time on the apply_vocabulary
step. For example the above transformation failed with:
tensorflow.python.framework.errors_impl.FailedPreconditionError: Table not initialized.
[[{{node StatefulPartitionedCall/StatefulPartitionedCall/transform_features_layer/StatefulPartitionedCall/transform/compute_and_apply_vocabulary_1/apply_vocab/hash_table_Lookup/LookupTableFindV2}}]] [Op:__inference_signature_wrapper_2787]
Since the table does exist in the Transform output, do you mind sharing the code snippet for how the trained model is being exported? In particular, is the tft_layer assigned to an attribute of the exported model [1]? I am assuming this is a Keras model from the stacktrace.
[1] https://github.com/tensorflow/transform/blob/master/examples/census_example_v2.py#L120
yep, it's a Keras model. The TFT layer is attached as an attribute of the keras model:
model.tft_layer = self.tft_transform_output.transform_features_layer()
This is the bit of code where we export the model https://gist.github.com/awadalaa/bcafb5da46ced7d9373f0d51ce389aa3#file-gistfile1-txt-L24
hi @varshaan I put together a small example repository that consistently reproduces the issue based on the census example you linked: https://github.com/awadalaa/TFTReproduceIssue
you can clone the repo and run this to reproduce the problem:
pip install -r requirements.txt
python -m data.task
python -m trainer.task
python -m inference.task
Hi, That repro has 2 keras models. The "full_model" [1] does not track the tft layer. Adding full_model.tft_layer=self.tft_transform_output.transform_features_layer()
after l69 in [1] fixes the repro. Normally no asset files would have been exported to the trainer model. However, since you define categorical feature columns for all the vocabularies other than the ones used to evaluate tfidf, the feature columns ended up tracking those asset files in the full_model and hence they got exported fine. The missing asset files evaluate features defined as numeric columns and hence this tracking through the feature columns didn't exist for them.
[1] https://github.com/awadalaa/TFTReproduceIssue/blob/main/trainer/model.py#L69
@awadalaa Does that fix the problem? If so then we should close this issue.
thank you @rcrowe-google and @varshaan! Attaching the tft_layer to the full_model does unblock us!
I'm not sure if the issue should be closed though. It was unexpected because the tft_layer
was attached through the prediction signature and the predictions failed when using the signature. I would have expected that failure mode if I had made the predictions using the model.predict
or model.__call__
explicitly but not when using the prediction signature. Any reason why the full_model needs to track the tft_layer here rather than rely on the prediction signatures tft_layer?
My understanding is that Keras expects that all resources that need to be tracked are tracked by the main object that is being saved (in this case the full_model). I suspect it isn't common that the signatures are on a model different from the one being saved. I will try and verify this and get back to you.
Posting for @awadalaa
We are blocked on experimenting with a new Tensorflow model in production because it fails to inference with this error:
tensorflow.python.framework.errors_impl.FailedPreconditionError: Table not initialized.
We have narrowed down the issue to a bit of our code that applies a bm25 transformation in a Tensorflow-Transform job. As part of applying that transformation, it learns and applies a vocabulary however when we inference the model it fails to initialize the table from that vocabulary file on this line. Here is the BM25 code we are using and the line where it fails: https://gist.github.com/awadalaa/e9290cf6674884d8e197fe315ed7d832#file-gistfile1-txt-L176-L177
More background: We run a Tensorflow-Transform Beam/Dataflow job that executes this transformation and saves the transform graph. Later when we train our model, we save it with a signature that applies the TFT layer: transformed_features = model.tft_layer(parsed_features). We noticed that the exported model/assets directory does not include the intermediate vocabulary used by the above BM25 transformation although it does include every other vocabulary file learned in the TFT job. Any ideas why the above transformation would fail to export the vocabulary assets for a saved model?
Stack trace here:
Function call stack: signature_wrapper