Unable to save multi-task recommender model

znaeem commented 3 years ago

I was following the guide for the multi-task recommender found here, but when I tried to save using model.save(), I was unable to to do so with the following error:

WARNING:tensorflow:Skipping full serialization of Keras layer <tensorflow_recommenders.metrics.factorized_top_k.FactorizedTopK object at 0x7fa43148ae80>, because it is not built.
WARNING:tensorflow:Skipping full serialization of Keras layer <tensorflow_recommenders.metrics.factorized_top_k.FactorizedTopK object at 0x7fa43148ae80>, because it is not built.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/tracking.py:111: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/tracking.py:111: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:2309: Layer.updates (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:2309: Layer.updates (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
---------------------------------------------------------------------------
FailedPreconditionError                   Traceback (most recent call last)
<ipython-input-10-d6cd16d67b34> in <module>()
----> 1 model.save('./')

23 frames
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)

FailedPreconditionError: Failed to serialize the input pipeline graph: ResourceGather is stateful. 

- [ ] 

- [ ] [Op:DatasetToGraphV2]

I also cannot save it in HDF5 format, but I believe that's because the model in question is a custom subclassing of the model class. What is the appropriate way to save the model?

maciejkula commented 3 years ago

Thanks for the question! I have some workaround below but I agree that the ergonomics here aren't great. I'll investigate further and see if this can be made better.

The reason saving isn't working here is that the train/test step uses a retrieval metric defined over a dataset of movies. This isn't really saveable to SavedModel, since the test dataset is not included in the resulting export file.

Here are a couple of suggestions on how to deal with this:

When saving your model during training, use the ModelCheckpoint callback, or .save_weights/.load_weights. Because this saves only the weights of the model, it doesn't run afoul of the problem.
When exporting the retrieval model for serving our suggestion is to export a top-K layer - see the guide here. That submodel does not include the evaluation metrics, and so will be fine.
If you'd like to continue using model.save for training, you need to unset the metrics from self.retrieval_task before saving. For example, train the model as before, but call the following before calling model.save:

model.retrieval_task = tfrs.tasks.Retrieval()  # Removes the metrics.
model.compile()

cfregly commented 3 years ago

@maciejkula can we serve this saved model with TFServing similar to the BruteForce model? Same signature? Not at my keyboard now, but I can verify when I’m back.

znaeem commented 3 years ago

@maciejkula Thank you for the quick response, I will try the steps you have outlined.

cfregly commented 3 years ago

After saving, I seeing the signature below using saved_model_cli (along with an error, but ignoring this for now).

Note: I'm looking for a similar signature to the BruteForce model where i can pass in a user_id and receive a list of movie_titles using TF Serving. Is this possible with anything beyond the BruteForce approach?

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['__saved_model_init_op']:
  The given SavedModel SignatureDef contains the following input(s):
  The given SavedModel SignatureDef contains the following output(s):
    outputs['__saved_model_init_op'] tensor_info:
        dtype: DT_INVALID
        shape: unknown_rank
        name: NoOp
  Method name is: 

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['movie_title'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: serving_default_movie_title:0
    inputs['user_id'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: serving_default_user_id:0
    inputs['user_rating'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: serving_default_user_rating:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['output_1'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 32)
        name: StatefulPartitionedCall:0
    outputs['output_2'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 32)
        name: StatefulPartitionedCall:1
    outputs['output_3'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1)
        name: StatefulPartitionedCall:2
  Method name is: tensorflow/serving/predict
2020-11-02 15:50:11.810779: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-11-02 15:50:11.810823: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2020-11-02 15:50:11.810858: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395): /proc/driver/nvidia/version does not exist
Traceback (most recent call last):
  File "/opt/conda/bin/saved_model_cli", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/tools/saved_model_cli.py", line 1185, in main
    args.func(args)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/tools/saved_model_cli.py", line 715, in show
    _show_all(args.dir)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/tools/saved_model_cli.py", line 307, in _show_all
    _show_defined_functions(saved_model_dir)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/tools/saved_model_cli.py", line 187, in _show_defined_functions
    trackable_object = load.load(saved_model_dir)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/saved_model/load.py", line 603, in load
    return load_internal(export_dir, tags, options)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/saved_model/load.py", line 633, in load_internal
    ckpt_options)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/saved_model/load.py", line 131, in __init__
    self._restore_checkpoint()
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/saved_model/load.py", line 330, in _restore_checkpoint
    load_status = saver.restore(variables_path, self._checkpoint_options)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 1320, in restore
    checkpoint=checkpoint, proto_id=0).restore(self._graph_view.root)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 209, in restore
    restore_ops = trackable._restore_from_checkpoint_position(self)  # pylint: disable=protected-access
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/base.py", line 914, in _restore_from_checkpoint_position
    tensor_saveables, python_saveables))
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/tracking/util.py", line 290, in restore_saveables
    tensor_saveables)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/saving/saveable_object_util.py", line 361, in validate_and_slice_inputs
    _add_saveable(saveables, seen_ops, converted_saveable_object)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/training/saving/saveable_object_util.py", line 331, in _add_saveable
    saveable.name)
ValueError: The same saveable will be restored with two names: user_model/layer_with_weights-0/_table/.ATTRIBUTES/table

maciejkula commented 3 years ago

@cfregly if you'd like to export a brute-force based model, you'll need to pick out the retrieval model subcomponents.

# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
# recommends movies out of the entire movies dataset.
index.index(movies.batch(100).map(model.movie_model), movies)

# Get recommendations.
_, titles = index(tf.constant(["42"]))

# Save as before.
index.save(...)

In doing so you'll have a model that's trained jointly but served individually.

Nothing beyond brute force yet, but this will change soon.

znaeem commented 3 years ago

@maciejkula I have followed the steps above to try to export the brute force based model, but I get the following issue:

ValueError: Got non-flat/non-unique argument names for SavedModel signature 'serving_default': more than one argument to '__inference_signature_wrapper_5077' was named 'customer id'. Signatures have one Tensor per named input, so to have predictable names Python functions used to generate these signatures should avoid *args and Tensors in nested structures unless unique names are specified for each. Use tf.TensorSpec(..., name=...) to provide a name for a Tensor input.

I am not sure what the error means, but I think it has something to do with input names clashing. My model uses the prepossessing layers that use .adapt() to the columns of the training_dataset that are used in the functional models for query and candidate before they are used in their own model classes (like in this guide for QueryModel and CandidateModel).

maciejkula commented 3 years ago

I suspect that you're passing a nested dict of tensors into your function, and it has the customer_id key repeated more than once. When saving models, these are flattened, and so having the same key twice will cause the saving to fail.

Chm-vinicius commented 3 months ago

I have the same issue on try to save Bruteforce model based, someone have a workaround?

tensorflow / recommenders

Unable to save multi-task recommender model #136