tf keras model.save_weights not working with MirroredStrategy in 1.12

BenPoutine commented 5 years ago

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes, duplicated the minimal code from documentation snippet from here
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):"18.04.1 LTS (Bionic Beaver)"
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:None
TensorFlow installed from (source or binary): source and binary
TensorFlow version (use command below): v1.12.0-rc2-0-g748435b8ef
Python version: 3.6.6
Bazel version (if compiling from source): 0.15.2
GCC/Compiler version (if compiling from source): 6.4.0
CUDA/cuDNN version: 9.0 / 7.3
GPU model and memory: 2 x 1080-ti

Describe the current behavior Cannot save a tf.keras model if trained with MirroredStrategy either by calling save_weight or by a tf.keras.callbacks.ModelCheckpoint, but does work if MirroredStrategy is not used.

Code to reproduce the issue

inputs = tf.keras.layers.Input(shape=(1,))
predictions = tf.keras.layers.Dense(1)(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=predictions)

features = tf.data.Dataset.from_tensors([1.]).repeat(10000).batch(10)
labels = tf.data.Dataset.from_tensors([1.]).repeat(10000).batch(10)
train_dataset = tf.data.Dataset.zip((features, labels))

distribution = tf.contrib.distribute.MirroredStrategy()

model.compile(loss='categorical_crossentropy',
              optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.2),
              distribute=distribution)
model.fit(train_dataset, epochs=5, steps_per_epoch=10)

And adding:

model.save_weights('my_weight')

Error looks like:

Traceback (most recent call last):
  File "/home/BP/anaconda3/envs/tf12_gpu/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/values.py", line 72, in get
    return self._index[device]
KeyError: '/replica:0/task:0/device:CPU:0'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "minimal_example.py", line 17, in <module>
    model.save_weights('my_weight')
  File "/home/BP/anaconda3/envs/tf12_gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/network.py", line 1449, in save_weights
    session = backend.get_session()
  File "/home/BP/anaconda3/envs/tf12_gpu/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 469, in get_session
    _initialize_variables(session)
  File "/home/BP/anaconda3/envs/tf12_gpu/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 722, in _initialize_variables
    variables = _get_variables(ops.get_default_graph())
  File "/home/BP/anaconda3/envs/tf12_gpu/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 716, in _get_variables
    variables.update(opt.optimizer.variables())
  File "/home/BP/anaconda3/envs/tf12_gpu/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 787, in variables
    optimizer_variables = [v for v in self._non_slot_variables()
  File "/home/BP/anaconda3/envs/tf12_gpu/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 788, in <listcomp>
    if _from_current_graph(v)]
  File "/home/BP/anaconda3/envs/tf12_gpu/lib/python3.6/site-packages/tensorflow/python/training/optimizer.py", line 781, in _from_current_graph
    return variable.op.graph is current_graph
  File "/home/BP/anaconda3/envs/tf12_gpu/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/values.py", line 308, in op
    return self.get().op
  File "/home/BP/anaconda3/envs/tf12_gpu/lib/python3.6/site-packages/tensorflow/contrib/distribute/python/values.py", line 76, in get
    (device, self._index.keys(), device_util.current())), e)
  File "<string>", line 3, in raise_from
ValueError: Device /replica:0/task:0/device:CPU:0 not found in dict_keys(['/replica:0/task:0/device:GPU:0', '/replica:0/task:0/device:GPU:1']) (current device )

Changing to:

checkpoint_path = "my_weight"
cp_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1,
                                                 period=1)
model.fit(train_dataset,
          epochs=5,
          steps_per_epoch=10,
          callbacks=[cp_callback])

Also does not work an end up with:

Epoch 1/5
 1/10 [==>...........................] - ETA: 4s - loss: 1.1921e-07
Epoch 00001: saving model to model_dir/my_weight
WARNING:tensorflow:You are accessing attribute _replicated_modelof the DistributedCallbackModel that may not have been set correctly.
Traceback (most recent call last):
  File "minimal_example.py", line 35, in <module>
    callbacks=[cp_callback])
  File "/home/BP/anaconda3/envs/tf12_gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1624, in fit
    validation_steps=validation_steps)
  File "/home/BP/anaconda3/envs/tf12_gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_distributed.py", line 198, in fit_loop
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/home/BP/anaconda3/envs/tf12_gpu/lib/python3.6/site-packages/tensorflow/python/keras/callbacks.py", line 214, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/home/BP/anaconda3/envs/tf12_gpu/lib/python3.6/site-packages/tensorflow/python/keras/callbacks.py", line 599, in on_epoch_end
    self.model.save_weights(filepath, overwrite=True)
  File "/home/BP/anaconda3/envs/tf12_gpu/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 2336, in save_weights
    self._replicated_model.save_weights(filepath, overwrite=overwrite,
AttributeError: 'NoneType' object has no attribute 'save_weights'

ymodak commented 5 years ago

You need to save the weights of the model as a HDF5 file. Try doing model.save_weights('my_weight.h5') instead.

BenPoutine commented 5 years ago

@ymodak model.save_weights('my_weight.h5') model.save_weights('my_model.h5', save_format='h5') and model.save_weights('my_weight') witch should save in the TensorFlow checkpoint file format does not work when MirroredStrategy is use. All of them work if MirroredStrategyis not used.

BenPoutine commented 5 years ago

sorry about that

rmrao commented 5 years ago

You need to save inside distribution.scope(), so this should work:

with distribution.scope():
    model.save_weights('my_weight.h5')

Also if you're then trying to load weights under the MirroredStrategy, I think it will only load onto the first tower (although maybe this is fixed?). Anyway you can look here for an example of how to do it.

ymodak commented 5 years ago

You need to save inside distribution.scope(), so this should work:
with distribution.scope():
    model.save_weights('my_weight.h5')
Also if you're then trying to load weights under the MirroredStrategy, I think it will only load onto the first tower (although maybe this is fixed?). Anyway you can look here for an example of how to do it.

Did you get a chance to try this?

kevinyen-oath commented 5 years ago

I'm at version 1.12.0, and model.save_weights('my_weight.h5') works fine for a training model with MirroredStrategy.

I did ran into the callback issue as well. The following seems to work for me

ModelCheckpoint(filepath='...', save_weights_only=False)
# which is internally doing self.model.save(filepath, overwrite=True)

However this doesn't work and raises AttributeError: 'NoneType' object has no attribute 'save_weights' like OP's issue.

ModelCheckpoint(filepath='...', save_weights_only=True)
# which is internally doing self.model.save_weights(filepath, overwrite=True)

guptapriya commented 5 years ago

This should be working at master, as we have unittests for save and load weights now: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/distribute/python/keras_test.py#L1158

It might be working with 1.13 rc as well, but not 100% sure. please try it out and let us know if it is still broken.

tensorflow / tensorflow

tf keras model.save_weights not working with MirroredStrategy in 1.12 #23431