keras-team / tf-keras

The TensorFlow-specific implementation of the Keras API, which was the default Keras from 2019 to 2023.
Apache License 2.0
64 stars 30 forks source link

tf.keras.mixed_precision.LossScaleOptimizer causes Graph execution error when using tfa.optimizers.MultiOptimizer and mixed_precision #63

Open Farbdose opened 1 year ago

Farbdose commented 1 year ago

System information.

Describe the problem. I want to use mixed_precision and multi_optimizer at the same time.

Describe the current behavior. Tensorflow crashes with a Graph Execution error when using mixed precision with the multioptimizer

Describe the expected behavior. No crash.

Contributing.

Standalone code to reproduce the issue.

https://colab.research.google.com/drive/1dk9SXd88aVwWHs7mshnX8sR8FVoEJOt-?usp=sharing


import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_addons as tfa
from tensorflow.keras import mixed_precision

policy = mixed_precision.Policy('mixed_float16')

#uncomment to trigger bug
mixed_precision.set_global_policy(policy)

ds_train, = tfds.load('mnist',  split=['train'], as_supervised=True,)

ds_train = ds_train.cache()
ds_train = ds_train.batch(32)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)

model = tf.keras.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10)
])
optimizers = [
    tf.keras.optimizers.Adam(learning_rate=0.001),
    tf.keras.optimizers.Adam(learning_rate=0.002)
]

optimizers_and_layers = [
    (optimizers[0], model.layers[:2]), 
    (optimizers[1], model.layers[2:])
]
optimizer = tfa.optimizers.MultiOptimizer(optimizers_and_layers)

model.compile(optimizer=optimizer, loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))
model.fit(ds_train, epochs=1)
InvalidArgumentError: Graph execution error:

Detected at node 'cond_1/AssignAddVariableOp' defined at (most recent call last):
    File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "/usr/local/lib/python3.8/dist-packages/ipykernel_launcher.py", line 16, in <module>
      app.launch_new_instance()
    File "/usr/local/lib/python3.8/dist-packages/traitlets/config/application.py", line 992, in launch_instance
      app.start()
    File "/usr/local/lib/python3.8/dist-packages/ipykernel/kernelapp.py", line 612, in start
      self.io_loop.start()
    File "/usr/local/lib/python3.8/dist-packages/tornado/platform/asyncio.py", line 149, in start
      self.asyncio_loop.run_forever()
    File "/usr/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
      self._run_once()
    File "/usr/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
      handle._run()
    File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
      self._context.run(self._callback, *self._args)
    File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 690, in <lambda>
      lambda f: self._run_callback(functools.partial(callback, future))
    File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 743, in _run_callback
      ret = callback()
    File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 787, in inner
      self.run()
    File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 748, in run
      yielded = self.gen.send(value)
    File "/usr/local/lib/python3.8/dist-packages/ipykernel/kernelbase.py", line 365, in process_one
      yield gen.maybe_future(dispatch(*args))
    File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 209, in wrapper
      yielded = next(result)
    File "/usr/local/lib/python3.8/dist-packages/ipykernel/kernelbase.py", line 268, in dispatch_shell
      yield gen.maybe_future(handler(stream, idents, msg))
    File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 209, in wrapper
      yielded = next(result)
    File "/usr/local/lib/python3.8/dist-packages/ipykernel/kernelbase.py", line 543, in execute_request
      self.do_execute(
    File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 209, in wrapper
      yielded = next(result)
    File "/usr/local/lib/python3.8/dist-packages/ipykernel/ipkernel.py", line 306, in do_execute
      res = shell.run_cell(code, store_history=store_history, silent=silent)
    File "/usr/local/lib/python3.8/dist-packages/ipykernel/zmqshell.py", line 536, in run_cell
      return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py", line 2854, in run_cell
      result = self._run_cell(
    File "/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py", line 2881, in _run_cell
      return runner(coro)
    File "/usr/local/lib/python3.8/dist-packages/IPython/core/async_helpers.py", line 68, in _pseudo_sync_runner
      coro.send(None)
    File "/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py", line 3057, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py", line 3249, in run_ast_nodes
      if (await self.run_code(code, result,  async_=asy)):
    File "/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py", line 3326, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "<ipython-input-6-287dce801bee>", line 18, in <module>
      model.fit(ds_train, epochs=1)
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1650, in fit
      tmp_logs = self.train_function(iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1249, in train_function
      return step_function(self, iterator)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1233, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1222, in run_step
      outputs = model.train_step(data)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1027, in train_step
      self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    File "/usr/local/lib/python3.8/dist-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 588, in minimize
      return self.apply_gradients(grads_and_vars, name=name)
    File "/usr/local/lib/python3.8/dist-packages/keras/mixed_precision/loss_scale_optimizer.py", line 837, in apply_gradients
      maybe_apply_op = tf.__internal__.smart_cond.smart_cond(
    File "/usr/local/lib/python3.8/dist-packages/keras/mixed_precision/loss_scale_optimizer.py", line 821, in do_not_apply_fn
      return self._optimizer.iterations.assign_add(1, read_value=False)
Node: 'cond_1/AssignAddVariableOp'
Cannot update variable with shape [0] using a Tensor with shape [], shapes must be equal.
     [[{{node cond_1/AssignAddVariableOp}}]] [Op:__inference_fn_with_cond_1266]
sushreebarsa commented 1 year ago

@Farbdose Thank you for reporting this issue! Could you please provide the access for the standalone code? You may share the colab gist as well if possible. Thank you!

Farbdose commented 1 year ago

@sushreebarsa Oh sorry, that slipped my mind. Done.

sushreebarsa commented 1 year ago

@SuryanarayanaY I was able to replicate the issue on colab, please find the gist here. Thank you!

SuryanarayanaY commented 1 year ago

Hi @Farbdose , This might be due to the below note from tfa.optimizers.MultiOptimizer API.

Note: Currently, tfa.optimizers.MultiOptimizer does not support callbacks that modify optimizers.

Since setting policy to mixed_float16 automatically applies Loss scaling that causing the error.

If you change the policy that won't apply loss scaling then there is no error here.For example I have tried setting policy to 'float64' and 'float32' and in both cases there is no error found as per the attached gist.

Thankyou!

Farbdose commented 1 year ago

@SuryanarayanaY Unfortunally I need mixed_float16 due to ram constraints. I did some digging and it looks like, the init code for the iterations variable is not triggered. I'm found a hacky solution for tensorflow 2.10.0 (which I actually need - for some reason the error persists in 2.12 even though the code in question was changed)

In 2.10 iterations is initialized via its getter so I added a

trigger_iterations_init_to_bypass_issue17414 = self._optimizer.iterations

here https://github.com/keras-team/keras/blob/v2.10.0/keras/mixed_precision/loss_scale_optimizer.py#L669

Which works for now. I think the main problem is this https://github.com/keras-team/keras/blob/master/keras/mixed_precision/loss_scale_optimizer.py#L645

gergely-soti commented 1 year ago

I ran into the same problem today. Thanks to your hacky solution @Farbdose i was able to fix it by triggering the initialization of iterations right after I initialized the LossScaleOptimizer:

    optimizer_nerf = tf.keras.optimizers.Adam()
    optimizer_feature = tf.keras.optimizers.Adam()
    optimizers_and_layers = [
        (optimizer_nerf, model.layers[:2]),
        (optimizer_feature, model.layers[2:])
    ]
    optimizer = MultiOptimizer(optimizers_and_layers)
    optimizer = tf.keras.mixed_precision.LossScaleOptimizer(optimizer)
    optimizer._optimizer.iterations
SuryanarayanaY commented 1 year ago

@Farbdose ,Thanks for your hacky solution.Lets see how it might help us.

ianstenbit commented 1 year ago

@qlzh727 could you take a look?

Looking through the optimizer code, I don't see why calling the iterations property updates the underlying tf variable. I see that you authored https://github.com/keras-team/keras/blob/master/keras/optimizers/optimizer.py#L96 -- do you have some context about how this initialization might be failing in this case?

Farbdose commented 1 year ago

@ianstenbit @qlzh727 calling the iterations property is a 2.10 (and 2.11 I think) only solution, because in 2.10 the iterations initialization is handled here in this getter https://github.com/keras-team/keras/blob/b80dd12da9c0bc3f569eca3455e77762cf2ee8ef/keras/optimizers/optimizer_v2/optimizer_v2.py#L1136 I haven't figured out why its not working in 2.12 yet. In 2.12 iterations in initialized here https://github.com/keras-team/keras/blob/541177c71887172d11514cda24067f7ab8d8440e/keras/optimizers/optimizer.py#L93 . I suspect that the constructor of optimized V2 is never actually called though - based on this comment here https://github.com/keras-team/keras/blob/541177c71887172d11514cda24067f7ab8d8440e/keras/mixed_precision/loss_scale_optimizer.py#L645


Update: I had to update my colab with

!pip install tensorflow -U
!pip install tf-nightly tfds-nightly tfa-nightly

hopefully that didn't create a version mismatch...

Based on my analysis here https://colab.research.google.com/drive/17rvRgYM6T8MDD0kkkpL_herqcoOMVhYj?usp=sharing

The optimizer has a properly set _iterations (debunking my assumption from above) but the MultiOptimizer hasn't I'm struggeling to find out where the actual code is comming from, running

!grep -rnw '/usr/local/lib/python3.8/dist-packages' -e 'def iterations' -A 10

inside colab finds the init code in /usr/local/lib/python3.8/dist-packages/keras/optimizers/optimizer_v2/optimizer_v2.py:1145 which shouldn't be there by my understanding in 2.12

so I'm a bit lost here as apparently my colab hasn't the keras version it claims to have....


So basically this still works in 2.12 (at least in the colab above) but I have no idea why...

optimizers_and_layers = [
    (optimizers[0], model.layers[:2]), 
    (optimizers[1], model.layers[2:])
]
optimizer = tfa.optimizers.MultiOptimizer(optimizers_and_layers)
optimizer.iterations

Update 2: to create even more confusion, the original problem was that this line crashed because iterations wasn't initialized even though that should happen through the getter: optimizer.iterations.assign_add(1, read_value=False) I found out by chance that the error goes away if I access iterations by hand before that line... now I tried executing the above line manually to trigger the init and it workes same line - just in my file and it works - but if it runs inside the graph executor ... boom


So maybe this line running inside the graph executor is the actual problem?