Closed anshkumar closed 1 year ago
@anshkumar Could you please use the stable TF version 2.11 instead of nightly? Please find the attached gist and confirm the issue. Thank you!
Have already tried using tf 2.10, 2.11, 2.12.
@anshkumar Thank you for the quick update! @SuryanarayanaY Could you please have a look at this issue? Thank you!
Hi @anshkumar ,
In the colab gist attached I can't find any performance comparison that you mentioned. I can see there the training time comparison with and withouttf.function
annotation. Could you please share a reproducible gist for the reported performance issue or can you explain which steps you are comparing in the attached gist?
Normally model configuration and loss calculation take very much time compared to gradient calculation and applying the gradients to update weights which is computationally expensive.
Thanks!
I've added time calculation to train_step as follows:
@tf.function
def train_step(x, y):
with tf.GradientTape() as tape:
tic = time.time()
logits = model(x, training=True)
loss_value = loss_fn(y, logits)
tf.print("Model Time: ", time.time()-tic)
tic = time.time()
grads = tape.gradient(loss_value, model.trainable_weights)
tf.print("Grad Time: ", time.time()-tic)
tic = time.time()
optimizer.apply_gradients(zip(grads, model.trainable_weights))
tf.print("Apply Grad Time: ", time.time()-tic)
train_acc_metric.update_state(y, logits)
return loss_value
Also, increase the number of layers in model to 256.
inputs = keras.Input(shape=(784,), name="digits")
x1 = layers.Dense(256, activation="relu")(inputs)
x2 = layers.Dense(256, activation="relu")(x1)
outputs = layers.Dense(10, name="predictions")(x2)
model = keras.Model(inputs=inputs, outputs=outputs)
Hi @anshkumar ,
I see for each iteration model()+loss, gradient and apply gradients taking 0.0340, 0.0248 & 0.0542 respectively as per attached gist.
Please refer the code for apply_gradients
below.
def apply_gradients(self, grads_and_vars, name=None):
self._compute_current_learning_rate()
grads_and_vars = list(grads_and_vars)
if len(grads_and_vars) == 0:
# It is possible that the grad is empty. In this case,
# `apply_gradients` is a no-op.
return self._iterations
grads, trainable_variables = zip(*grads_and_vars)
scope_name = name or self.name or "optimizer"
with tf.name_scope(scope_name):
with tf.init_scope():
# Lift variable creation to init scope to avoid environment
# issues.
self.build(trainable_variables)
grads_and_vars = list(zip(grads, trainable_variables))
grads_and_vars = optimizer_utils.filter_empty_gradients(grads_and_vars)
if len(list(grads_and_vars)) == 0:
# Check again after filtering gradients.
return self._iterations
grads, trainable_variables = zip(*grads_and_vars)
grads = self._clip_gradients(grads)
grads = self._deduplicate_sparse_grad(grads)
self._apply_weight_decay(trainable_variables)
grads_and_vars = list(zip(grads, trainable_variables))
iteration = self._internal_apply_gradients(grads_and_vars)
# Apply variable constraints after applying gradients.
for variable in trainable_variables:
if variable.constraint is not None:
variable.assign(variable.constraint(variable))
return iteration
apply_gradients
method involves also undergo through the methods _clip_gradients(grads)
, _deduplicate_sparse_grad(grads)
and _apply_weight_decay(trainable_variables)
based on the parameters passed to the Optimizer and its computationally expensive compared to others. If we pass some parameters to Optimizers like momentum, weight_decay,clip_norm,clip_value etc, I believe the computations even take more time.
AFAIK currently there is no such benchmark for comparison of these steps.But apply_gradients
will take more time as compared to other steps. If you have any idea in mind that can improve performance please feel free to share and raise PR.
Thanks for bringing this.
This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.
This issue was closed because it has been inactive for 28 days. Please reopen if you'd like to work on this further.
I'm using a custom training loop for training a model made using EfficientNetV2 with biFPN. I found that
optimizer.apply_gradients
was running very slow. It took around 1.7 seconds toapply_gradients
while model was taking only 0.1 seconds. I tried to replicate this using smaller example, so followed tutorial here. I found a similar issue here also.apply_gradients
here also is taking twice the time compared to model+loss. Is it fine to have such behavior? I tried a different version of tensorflow, but found the same behavior.Standalone code to reproduce the issue: The code used can be found here: https://colab.research.google.com/github/tensorflow/docs/blob/snapshot-keras/site/en/guide/keras/writing_a_training_loop_from_scratch.ipynb
Tensorflow Version = 2.13 OS Platform and Distribution = Ubuntu 20.04 Python version = 3.10