stekiri commented 4 years ago

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: NA
TensorFlow installed from (source or binary): Binary (pip)
TensorFlow version (use command below): v2.0.0-rc2-26-g64c3d382ca 2.0.0
Python version: Python 3.6.8
Bazel version (if compiling from source): NA
GCC/Compiler version (if compiling from source): NA
CUDA/cuDNN version: CUDA 10.0.130_411.31; cuDNN 10.0 v7.6.5.32
GPU model and memory: NVIDIA Quadro P2000, 4 GB

Describe the current behavior When the model is saved in the default tf format, warnings are logged when trying to serve the model.

Examplary warning logs:

WARNING:tensorflow:5 out of the last 5 calls to <function recreate_function.<locals>.restored_function_body at 0x000001ED79058730> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings is likely due to passing python objects instead of tensors. Also, tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. Please refer to https://www.tensorflow.org/beta/tutorials/eager/tf_function#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.

When the model is saved in the hdf5 format, the warnings do not occur.

Describe the expected behavior The save formats should be equivalent and behave in the same way.

Code to reproduce the issue Execute the following scripts to create and serve model

Run the first script with format_ext = '' which saves the model in tf format, restart the Python console, serve the model with the second script which creates the aforementioned warnings.
When running the scripts with format_ext = '.h5', the model is saved in hdf5 format and no warnings appear.

Model creation:

import os

import tensorflow as tf

format_ext = ''  # '.h5' or empty for tf format
model_path = os.path.join('out', 'mnist-classifier{}'.format(format_ext))

gpus = tf.config.experimental.list_physical_devices('GPU')

tf.config.experimental.set_virtual_device_configuration(
    gpus[0],
    [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=512),
     tf.config.experimental.VirtualDeviceConfiguration(memory_limit=512),
     tf.config.experimental.VirtualDeviceConfiguration(memory_limit=512)]
)

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    inputs = tf.keras.Input(shape=(784,), name='digits')
    x = tf.keras.layers.Dense(64, activation='relu', name='dense_1')(inputs)
    x = tf.keras.layers.Dense(64, activation='relu', name='dense_2')(x)
    outputs = tf.keras.layers.Dense(10, activation='softmax', name='predictions')(x)

    model = tf.keras.Model(inputs=inputs, outputs=outputs)

    model.compile(optimizer=tf.keras.optimizers.RMSprop(),  # Optimizer
                  # Loss function to minimize
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                  # List of metrics to monitor
                  metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

model.save(model_path)

Model serving:

import os

import tensorflow as tf

format_ext = ''  # '.h5' or empty for tf format
model_path = os.path.join('out', 'mnist-classifier{}'.format(format_ext))

gpus = tf.config.experimental.list_physical_devices('GPU')

tf.config.experimental.set_virtual_device_configuration(
    gpus[0],
    [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=512),
     tf.config.experimental.VirtualDeviceConfiguration(memory_limit=512),
     tf.config.experimental.VirtualDeviceConfiguration(memory_limit=512)]
)

(_, _), (x_test, _) = tf.keras.datasets.mnist.load_data()
x_test = x_test.reshape(10000, 784).astype('float32') / 255

strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
    loaded_model = tf.keras.models.load_model(model_path)
    predictions = loaded_model.predict(x_test, batch_size=64)

Other info / logs The warnings occur only if more than two vGPUs are used.

gadagashwini-zz commented 4 years ago

I could replicate the issue with Tf 2.0 on colab. Please find the gist here. Thanks!

jvishnuvardhan commented 4 years ago

@stekiri Can you please try with TF2.1 and tf-nightly. When I ran it in colab, I am seeing different warning as follows. Thanks!

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
WARNING:tensorflow:NCCL is not supported when using virtual GPUs, fallingback to reduction to one device
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2')
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).

stekiri commented 4 years ago

With TF 2.1 the warning I mentioned does not appear for above script, however, if I use four virtual GPUs (in the serving script) instead of three, the warning is displayed again.

You can reproduce it with the following virtual device config:

tf.config.experimental.set_virtual_device_configuration(
    gpus[0],
    [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=512),
     tf.config.experimental.VirtualDeviceConfiguration(memory_limit=512),
     tf.config.experimental.VirtualDeviceConfiguration(memory_limit=512),
     tf.config.experimental.VirtualDeviceConfiguration(memory_limit=512)]
)

jvishnuvardhan commented 4 years ago

@stekiri I cannot reproduce that warning. When I modify as you mentioned above, i get RuntimeError as follows. Please check the gist here. Thanks!

RuntimeError Traceback (most recent call last)
in () 9 tf.config.experimental.VirtualDeviceConfiguration(memory_limit=512), 10 tf.config.experimental.VirtualDeviceConfiguration(memory_limit=512), ---> 11 tf.config.experimental.VirtualDeviceConfiguration(memory_limit=512)] 12 ) 13 1 frames /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/context.py in set_logical_device_configuration(self, dev, virtual_devices) 1300 if self._context_handle is not None: 1301 raise RuntimeError( -> 1302 "Virtual devices cannot be modified after being initialized") 1303 1304 self._virtual_device_map[dev] = virtual_devices RuntimeError: Virtual devices cannot be modified after being initialized

stekiri commented 4 years ago

You would need to restart the runtime in between running the save and load script as it's not possible to change virtual devices once they have been initialized.

jvishnuvardhan commented 4 years ago

@stekiri I ran first part of your code and restarted runtime and ran second part of your code. I cannot reproduce the issue when I used recent tf-nightly. Please check the gist here.

Can you please check once and close the issue if this was resolved for you? Thanks!

stekiri commented 4 years ago

It seems to be resolved. Thanks!

tensorflow-bot[bot] commented 4 years ago

Are you satisfied with the resolution of your issue? Yes No

GF-Huang commented 3 years ago

jvishnuvardhan commented 3 years ago

@GF-Huang Can you please open a new issue with a standalone code to reproduce the issue? thanks!

GF-Huang commented 3 years ago

@GF-Huang Can you please open a new issue with a standalone code to reproduce the issue? thanks!

47554

TheMoMatthias commented 2 years ago

I am experiencing the same error in tensorflow version 2.9.1. I just now kept receiving the error message after removing the arguments: activation='relu' and dropout=0.2. Does this affect the model somehow or can the error message be ignored?

tensorflow / tensorflow

Missing information when saving model in tf format #35146

47554