keras-team / tf-keras

The TensorFlow-specific implementation of the Keras API, which was the default Keras from 2019 to 2023.
Apache License 2.0
64 stars 31 forks source link

Tensorflow optimizer.apply_gradients is very slow. #238

Closed anshkumar closed 1 year ago

anshkumar commented 1 year ago

I'm using a custom training loop for training a model made using EfficientNetV2 with biFPN. I found that optimizer.apply_gradients was running very slow. It took around 1.7 seconds to apply_gradients while model was taking only 0.1 seconds. I tried to replicate this using smaller example, so followed tutorial here. I found a similar issue here also. apply_gradients here also is taking twice the time compared to model+loss. Is it fine to have such behavior? I tried a different version of tensorflow, but found the same behavior.

Model+Loss  0.027350187301635742
Grad  0.027795791625976562
Apply Grad  0.056526899337768555

Standalone code to reproduce the issue: The code used can be found here: https://colab.research.google.com/github/tensorflow/docs/blob/snapshot-keras/site/en/guide/keras/writing_a_training_loop_from_scratch.ipynb

Tensorflow Version = 2.13 OS Platform and Distribution = Ubuntu 20.04 Python version = 3.10

sushreebarsa commented 1 year ago

@anshkumar Could you please use the stable TF version 2.11 instead of nightly? Please find the attached gist and confirm the issue. Thank you!

anshkumar commented 1 year ago

Have already tried using tf 2.10, 2.11, 2.12.

sushreebarsa commented 1 year ago

@anshkumar Thank you for the quick update! @SuryanarayanaY Could you please have a look at this issue? Thank you!

SuryanarayanaY commented 1 year ago

Hi @anshkumar ,

In the colab gist attached I can't find any performance comparison that you mentioned. I can see there the training time comparison with and withouttf.function annotation. Could you please share a reproducible gist for the reported performance issue or can you explain which steps you are comparing in the attached gist?

Normally model configuration and loss calculation take very much time compared to gradient calculation and applying the gradients to update weights which is computationally expensive.

Thanks!

anshkumar commented 1 year ago

I've added time calculation to train_step as follows:

@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        tic = time.time()
        logits = model(x, training=True)
        loss_value = loss_fn(y, logits)
        tf.print("Model Time: ", time.time()-tic)
    tic = time.time()
    grads = tape.gradient(loss_value, model.trainable_weights)
    tf.print("Grad Time: ", time.time()-tic)
    tic = time.time()
    optimizer.apply_gradients(zip(grads, model.trainable_weights))
    tf.print("Apply Grad Time: ", time.time()-tic)
    train_acc_metric.update_state(y, logits)
    return loss_value
anshkumar commented 1 year ago

Also, increase the number of layers in model to 256.

inputs = keras.Input(shape=(784,), name="digits")
x1 = layers.Dense(256, activation="relu")(inputs)
x2 = layers.Dense(256, activation="relu")(x1)
outputs = layers.Dense(10, name="predictions")(x2)
model = keras.Model(inputs=inputs, outputs=outputs)
SuryanarayanaY commented 1 year ago

Hi @anshkumar ,

I see for each iteration model()+loss, gradient and apply gradients taking 0.0340, 0.0248 & 0.0542 respectively as per attached gist.

Please refer the code for apply_gradients below.

  def apply_gradients(self, grads_and_vars, name=None):
      self._compute_current_learning_rate()
      grads_and_vars = list(grads_and_vars)
      if len(grads_and_vars) == 0:
          # It is possible that the grad is empty. In this case,
          # `apply_gradients` is a no-op.
          return self._iterations
      grads, trainable_variables = zip(*grads_and_vars)
      scope_name = name or self.name or "optimizer"
      with tf.name_scope(scope_name):
          with tf.init_scope():
              # Lift variable creation to init scope to avoid environment
              # issues.
              self.build(trainable_variables)
      grads_and_vars = list(zip(grads, trainable_variables))
      grads_and_vars = optimizer_utils.filter_empty_gradients(grads_and_vars)
      if len(list(grads_and_vars)) == 0:
          # Check again after filtering gradients.
          return self._iterations

      grads, trainable_variables = zip(*grads_and_vars)

      grads = self._clip_gradients(grads)
      grads = self._deduplicate_sparse_grad(grads)
      self._apply_weight_decay(trainable_variables)
      grads_and_vars = list(zip(grads, trainable_variables))
      iteration = self._internal_apply_gradients(grads_and_vars)

      # Apply variable constraints after applying gradients.
      for variable in trainable_variables:
          if variable.constraint is not None:
              variable.assign(variable.constraint(variable))
      return iteration

apply_gradients method involves also undergo through the methods _clip_gradients(grads), _deduplicate_sparse_grad(grads) and _apply_weight_decay(trainable_variables) based on the parameters passed to the Optimizer and its computationally expensive compared to others. If we pass some parameters to Optimizers like momentum, weight_decay,clip_norm,clip_value etc, I believe the computations even take more time.

AFAIK currently there is no such benchmark for comparison of these steps.But apply_gradients will take more time as compared to other steps. If you have any idea in mind that can improve performance please feel free to share and raise PR.

Thanks for bringing this.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further.

google-ml-butler[bot] commented 1 year ago

Are you satisfied with the resolution of your issue? Yes No