tensorflow / ranking

Learning to Rank in TensorFlow
Apache License 2.0
2.74k stars 476 forks source link

TensorFlow model cannot be parsed within the memory limit #256

Open tillwf opened 3 years ago

tillwf commented 3 years ago

Hello,

I'm trying to upload a model generated with TFRanking (32Mb) to BigQuery which I saved like this:

signatures = {
    'serving_default':
        make_keras_tft_serving_fn(
            ranker,
            tf_transform_output,
            context_cols,
            example_cols
        ).get_concrete_function(
            tf.TensorSpec(
                shape=[None],
                dtype=tf.string,
                name='examples'
            )
        ),
}
ranker.load_weights(checkpoint)
ranker.save(model_dir, save_format='tf', signatures=signatures)

but I got this error:

Error while reading data, error message: TensorFlow model cannot be parsed within the memory limit; try reducing the model size

error_bq_tfranking

Previously I managed to upload bigger model (>200Mb) created with regular TF 1.13 code, so I don't understand the message.

Did someone already encounter this ?

Thanks

On Ubuntu 18.04, Python 3.7.3

tensorflow==2.4.1
tensorflow-addons==0.12.1
tensorflow-datasets==4.2.0
tensorflow-estimator==2.4.0
tensorflow-hub==0.11.0
tensorflow-metadata==0.29.0
tensorflow-model-optimization==0.5.0
tensorflow-ranking==0.3.3
tensorflow-serving-api==2.4.1
tensorflow-transform==0.29.0
tillwf commented 3 years ago

I tried with the latest version of tensorflow-ranking (0.4.0) and it is still not working. Could someone help me ? Thank you

tillwf commented 3 years ago

Here is a ncdu of the model folder:

  335.0 MiB [##########] /train
   31.5 MiB [          ]  saved_model.pb
    3.1 MiB [          ] /validation
    2.4 MiB [          ] /variables
  792.0 KiB [          ]  keras_metadata.pb
   84.0 KiB [          ] /assets

I tried without the train folder, but it didn't change the message.

Any clue ?

tillwf commented 3 years ago

I followed the memory consumption of a script doing:

import tensorflow as tf
model = tf.saved_model.load("model_path")

(the model path does not contain the train folder)

and we see this

plot1

Is there a way to reduce this memory usage ? The model weight is only 30MiB on disk and become 2GiB in memory.

tillwf commented 3 years ago

I tried to reduce the size of the model by doing:

converter = tf.lite.TFLiteConverter.from_keras_model(ranker)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
converter.convert().save(model_dir, save_format='tf', signatures=signatures)

but I got this error:

2021-06-28 14:55:07.469989: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
W0628 14:55:07.577553 140086528587584 signature_serialization.py:151] Function `_wrapped_model` contains input name(s) args_0 with unsupported characters which will be renamed to args_0_275 in the SavedModel.
W0628 14:55:34.146763 140086528587584 save.py:243] Found untraced functions such as listwise_dense_features_layer_call_and_return_conditional_losses, listwise_dense_features_layer_call_fn, dense_3_layer_call_and_return_conditional_losses, dense_3_layer_call_fn, listwise_dense_features_layer_call_and_return_conditional_losses while saving (showing 5 of 65). These functions will not be directly callable after loading.
INFO:tensorflow:Assets written to: /tmp/tmpte46l_mb/assets
I0628 14:55:41.276194 140086528587584 builder_impl.py:775] Assets written to: /tmp/tmpte46l_mb/assets
2021-06-28 14:55:48.804822: I tensorflow/core/grappler/devices.cc:69] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2021-06-28 14:55:48.804951: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session
2021-06-28 14:55:48.949690: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:1144] Optimization results for grappler item: graph_to_optimize
  function_optimizer: function_optimizer did nothing. time = 0.042ms.
  function_optimizer: function_optimizer did nothing. time = 0ms.
​
*** tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot convert a Tensor of dtype resource to a NumPy array.

Could it be related to this answer https://github.com/tensorflow/tensorflow/issues/37441#issuecomment-775747315 ?

Does anyone has any idea how I can reduce the model memory size ?

tanguycdls commented 3 years ago

We suffer from the same issue, it seems to come from the structure of the model (number of ops): for now we tried reducing the number of layers and it seems to have reduced the issue. I'm not sure TF lite would work you wont be able to export it back to savedmodel format.

Did you find anything else on your side ?

tillwf commented 3 years ago

Hello @tanguycdls We did not find any solution yet, but it is critical for us. We won't be able to use TFRanking without this feature. We only have 3 layers for the moment which does not seem to be a big number. We will try with one layer just to see, but it is not a viable solution either.

tanguycdls commented 3 years ago

We think we found a workaround but we're still not sure it's viable: we tried transforming our keras models to the old (tf1 format) frozen model that we then re attach to a savedmodel. It seems to reduce the ram.

take a look at this: https://leimao.github.io/blog/Save-Load-Inference-From-TF2-Frozen-Graph/

and then when you have your concrete function reattach it to a tf.Module.

full_model = tf.function(lambda x: model(x))
full_model = full_model.get_concrete_function(
    tf.TensorSpec(model.inputs[0].shape, model.inputs[0].dtype)) # you should fix that to the correct input shapes

# Get frozen ConcreteFunction
frozen_func = convert_variables_to_constants_v2(full_model)
frozen_func.graph.as_graph_def()

module = tf.Module()
module.func = frozen_func
tf.savedmodel.save()... # must specify signature

again we're still working on the topic so we dont have any long term view of the solution: there might be an issue somewhere...

If you find an issue or have a better idea please tell us !

and some links: https://github.com/search?q=convert_variables_to_constants_v2&type=code

tillwf commented 2 years ago

Hello @tanguycdls Thank you again for your help. Did you find any proper solution ? Yours does not work for us as it raise another exception.

tanguycdls commented 2 years ago

Hello @tanguycdls Thank you again for your help. Did you find any proper solution ? Yours does not work for us as it raise another exception.

We still use the solution above this + some Grappler optims fixed the issue for most domains. You're using a model that cannot be freeze ?