Tensorflow executes slow on GPU - retracing issue?

Hi, I already asked this question on stackoverflow, but didn't get any responses. So I'll try here: I am trying to develop a transformer sequence to vector model but encounter performance issues. I am working with a Tesla V100-PCIE-16GB. Whenever the model encounters an unseen sequence length, the (training-) step takes much longer. I assume this is due to the tracing algorithm, creating the tensorflow graph. However, I do not get any retracing warnings or so. I know there is the @tf.function(reduce_retracing=True) decorator, but I would have to rewrite the MultiHeadAttention class from scratch and decorate every function. I am using tensorflow Version 2.17.0.

Here is code for replication. In this snippet I have a small dummy network, training on 100 samples. The first 10 batches have varying shapes and take around 1.5 seconds per batch. The last 90 batches have similar shapes, taking 0.002 seconds per batch.

How to approach this issue?

import datetime
import numpy as np
import tensorflow as tf
from tensorflow import keras

# only use the first gpu
gpus = tf.config.list_physical_devices("GPU")
assert gpus, "no gpus found"
gpu=gpus[1]
tf.config.set_visible_devices([gpu], "GPU")
tf.config.experimental.set_memory_growth(gpu, True)

nclasses = 3
d_model = 8

# A custom Data Generator, creating random samples
class DG(keras.utils.PyDataset):
    def __init__(self):
        self.nSamplesWithAlternatingShapes = 10
        super().__init__()
        self.last=datetime.datetime.now()

    def __len__(self):
        return 100

    def __getitem__(self, index):
        n = self.nSamplesWithAlternatingShapes
        if index < self.nSamplesWithAlternatingShapes:
            n = index
        x = np.random.random((32, n + 10, d_model))
        y = np.random.random((32, nclasses))
        y = np.argmax(y, axis=1)
        y = tf.keras.utils.to_categorical(y, num_classes=nclasses)
        now=datetime.datetime.now()
        print(f"\n{round((now-self.last).total_seconds(), ndigits=3)} secs elapsed")
        self.last=now
        return tf.convert_to_tensor(x), y

# A Test layer, incorporating a MultiHeadAttention Layer
class TestLayer(tf.keras.layers.Layer):
    def __init__(self):
        super().__init__()
        # some dummy multi head attention
        self.mha = tf.keras.layers.MultiHeadAttention(
            num_heads=1, value_dim=d_model, key_dim=d_model, dropout=0.3
        )
        self.dense = tf.keras.layers.Dense(nclasses, activation="softmax")

    def call(self, x):
        x = self.mha(x, x, x)
        pool = tf.keras.layers.GlobalAveragePooling1D(data_format="channels_last")(x)
        fin = self.dense(pool)
        return fin

if __name__ == "__main__":

    # The first dimension is the sequence dimension
    input = tf.keras.layers.Input((None, d_model), name="test", dtype=np.float32)
    output = TestLayer()(input)
    model = tf.keras.models.Model(inputs=[input], outputs=output)
    model.compile(
        optimizer="adam",
        loss="categorical_crossentropy",
        metrics=[tf.keras.metrics.CategoricalAccuracy(name="OA")],
    )

    model.fit(DG(), shuffle=False)

    print("done")

Hi @david-eidmann -

Thanks for reporting the issue. I have reproduce the issue in the macOS with this GPU "[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]". Without using @tf.function(reduce_retracing=True) And I am able to get for each batch.

  1/100 ━━━━━━━━━━━━━━━━━━━━ 2:11 1s/step - OA: 0.2812 - loss: 1.1340
0.111 secs elapsed
  2/100 ━━━━━━━━━━━━━━━━━━━━ 10s 111ms/step - OA: 0.2812 - loss: 1.1266
0.104 secs elapsed
  3/100 ━━━━━━━━━━━━━━━━━━━━ 10s 108ms/step - OA: 0.2812 - loss: 1.1226
0.094 secs elapsed
  4/100 ━━━━━━━━━━━━━━━━━━━━ 9s 103ms/step - OA: 0.2891 - loss: 1.1194 
0.101 secs elapsed
  5/100 ━━━━━━━━━━━━━━━━━━━━ 9s 103ms/step - OA: 0.2988 - loss: 1.1180
0.099 secs elapsed
  6/100 ━━━━━━━━━━━━━━━━━━━━ 9s 102ms/step - OA: 0.3071 - loss: 1.1164
0.095 secs elapsed
  7/100 ━━━━━━━━━━━━━━━━━━━━ 9s 101ms/step - OA: 0.3155 - loss: 1.1148
0.126 secs elapsed
  8/100 ━━━━━━━━━━━━━━━━━━━━ 9s 105ms/step - OA: 0.3244 - loss: 1.1130
0.108 secs elapsed
  9/100 ━━━━━━━━━━━━━━━━━━━━ 9s 106ms/step - OA: 0.3297 - loss: 1.1118
0.12 secs elapsed
 10/100 ━━━━━━━━━━━━━━━━━━━━ 9s 107ms/step - OA: 0.3323 - loss: 1.1112
0.094 secs elapsed
 11/100 ━━━━━━━━━━━━━━━━━━━━ 9s 105ms/step - OA: 0.3349 - loss: 1.1106
0.023 secs elapsed

Could you please try to run your code without using @tf.function(reduce_retracing=True) ?

Hi @mehtamansi29 , Thanks a lot for your response! I am not using @tf.function(reduce_retracing=True) anywhere in the code example . Therefore I am not exactly sure what you mean. I am glad you were able to reproduce the issue.

I adjusted the example, producing a more readable output like this:

100/100 ━━━━━━━━━━━━━━━━━━━━ 18s 139ms/step - OA: 0.3399 - loss: 1.1052
time per step with alternating shapes : 1.33098 sec
time per step with similar sized shapes : 0.048 sec
done

Here is the adjusted example:

import datetime
import numpy as np
import tensorflow as tf
from tensorflow import keras

# only use the first gpu
gpus = tf.config.list_physical_devices("GPU")
assert gpus, "no gpus found"
gpu = gpus[1]
tf.config.set_visible_devices([gpu], "GPU")
tf.config.experimental.set_memory_growth(gpu, True)

nclasses = 3
d_model = 8

# A custom Data Generator, creating random samples
class DG(keras.utils.PyDataset):
    def __init__(self):
        self.nSamplesWithAlternatingShapes = 10
        super().__init__()
        self.last = datetime.datetime.now()
        self.time_taken={}

    def __len__(self):
        return 100

    def __getitem__(self, index):
        n = self.nSamplesWithAlternatingShapes
        if index < self.nSamplesWithAlternatingShapes:
            n = index
        x = np.random.random((32, n + 10, d_model))
        y = np.random.random((32, nclasses))
        y = np.argmax(y, axis=1)
        y = tf.keras.utils.to_categorical(y, num_classes=nclasses)
        now = datetime.datetime.now()
        # print(f"\n{round((now-self.last).total_seconds(), ndigits=3)} secs elapsed")
        self.time_taken[index]=(now-self.last).total_seconds()
        self.last = now
        return tf.convert_to_tensor(x), y

    def printElapsedTime(self):
        alternatingIndices=list(range(self.nSamplesWithAlternatingShapes))

        alternatingSizeSamplesDurations=[self.time_taken[i] for i in self.time_taken if i in alternatingIndices]
        time_per_alternating_shape=np.mean(alternatingSizeSamplesDurations)
        print(f"time per step with alternating shapes : {time_per_alternating_shape:.5f} sec")

        similarSizedSamplesDurations=[self.time_taken[i] for i in self.time_taken if i not in alternatingIndices]
        time_per_similar_shape=np.mean(similarSizedSamplesDurations)
        print(f"time per step with similar sized shapes : {time_per_similar_shape:.3f} sec")

# A Test layer, incorporating a MultiHeadAttention Layer
class TestLayer(tf.keras.layers.Layer):
    def __init__(self):
        super().__init__()
        # some dummy multi head attention
        self.mha = tf.keras.layers.MultiHeadAttention(
            num_heads=1, value_dim=d_model, key_dim=d_model, dropout=0.3
        )
        self.dense = tf.keras.layers.Dense(nclasses, activation="softmax")

    def call(self, x):
        x = self.mha(x, x, x)
        pool = tf.keras.layers.GlobalAveragePooling1D(data_format="channels_last")(x)
        fin = self.dense(pool)
        return fin

if __name__ == "__main__":
    # The first dimension is the sequence dimension
    input = tf.keras.layers.Input((None, d_model), name="test", dtype=np.float32)
    output = TestLayer()(input)
    model = tf.keras.models.Model(inputs=[input], outputs=output)
    model.compile(
        optimizer="adam",
        loss="categorical_crossentropy",
        metrics=[tf.keras.metrics.CategoricalAccuracy(name="OA")],
    )
    dg=DG()
    model.fit(dg, shuffle=False)
    dg.printElapsedTime()

    print("done")

Hi @david-eidmann -

As per your adjusted code I am getting time per step in 0.20116 sec which is proper.

100/100 ━━━━━━━━━━━━━━━━━━━━ 4s 28ms/step - OA: 0.3150 - loss: 1.1009
time per step with alternating shapes : 0.20559 sec
time per step with similar sized shapes : 0.023 sec
done

Here is GPU configuration which gives above result:

GPU Configuration:
Graphics/Displays:
    Apple M1 Pro:
      Chipset Model: Apple M1 Pro
      Type: GPU
      Bus: Built-In
      Total Number of Cores: 14
      Vendor: Apple (0x106b)
      Metal Support: Metal 3

Can you share GPU configuration details(i.e chipset model,total number of cores etc) ?

Hi - this is my graphics card and it is utilized during training: It has 640 Tensor Cores. I do not think this issue has anything to do with the graphics card, but is due to the retracing algorithm in keras/tensorflow.

Sure, 0.20116 sec per step is proper, but this here is only a small example. My real model has thousands of times the amount of parameters and the full GPU utilization is necessary. In this example, the performance is 10 times as high on my machine, when using similar shaped batches. On your machine the factor is only 2 how it seems. However, with larger models the effect gets even bigger.

I assume keras is retracing the graph for every new shape. There must be some way to avoid the retracing algorithm, as it is not necessary here.

keras-team / keras

Tensorflow executes slow on GPU - retracing issue? #20369