Open david-eidmann opened 18 hours ago
Hi @david-eidmann -
Thanks for reporting the issue. I have reproduce the issue in the macOS with this GPU "[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]". Without using @tf.function(reduce_retracing=True)
And I am able to get for each batch.
1/100 ━━━━━━━━━━━━━━━━━━━━ 2:11 1s/step - OA: 0.2812 - loss: 1.1340
0.111 secs elapsed
2/100 ━━━━━━━━━━━━━━━━━━━━ 10s 111ms/step - OA: 0.2812 - loss: 1.1266
0.104 secs elapsed
3/100 ━━━━━━━━━━━━━━━━━━━━ 10s 108ms/step - OA: 0.2812 - loss: 1.1226
0.094 secs elapsed
4/100 ━━━━━━━━━━━━━━━━━━━━ 9s 103ms/step - OA: 0.2891 - loss: 1.1194
0.101 secs elapsed
5/100 ━━━━━━━━━━━━━━━━━━━━ 9s 103ms/step - OA: 0.2988 - loss: 1.1180
0.099 secs elapsed
6/100 ━━━━━━━━━━━━━━━━━━━━ 9s 102ms/step - OA: 0.3071 - loss: 1.1164
0.095 secs elapsed
7/100 ━━━━━━━━━━━━━━━━━━━━ 9s 101ms/step - OA: 0.3155 - loss: 1.1148
0.126 secs elapsed
8/100 ━━━━━━━━━━━━━━━━━━━━ 9s 105ms/step - OA: 0.3244 - loss: 1.1130
0.108 secs elapsed
9/100 ━━━━━━━━━━━━━━━━━━━━ 9s 106ms/step - OA: 0.3297 - loss: 1.1118
0.12 secs elapsed
10/100 ━━━━━━━━━━━━━━━━━━━━ 9s 107ms/step - OA: 0.3323 - loss: 1.1112
0.094 secs elapsed
11/100 ━━━━━━━━━━━━━━━━━━━━ 9s 105ms/step - OA: 0.3349 - loss: 1.1106
0.023 secs elapsed
Could you please try to run your code without using @tf.function(reduce_retracing=True)
?
Hi @mehtamansi29 ,
Thanks a lot for your response! I am not using @tf.function(reduce_retracing=True)
anywhere in the code example . Therefore I am not exactly sure what you mean.
I am glad you were able to reproduce the issue.
I adjusted the example, producing a more readable output like this:
100/100 ━━━━━━━━━━━━━━━━━━━━ 18s 139ms/step - OA: 0.3399 - loss: 1.1052
time per step with alternating shapes : 1.33098 sec
time per step with similar sized shapes : 0.048 sec
done
Here is the adjusted example:
import datetime
import numpy as np
import tensorflow as tf
from tensorflow import keras
# only use the first gpu
gpus = tf.config.list_physical_devices("GPU")
assert gpus, "no gpus found"
gpu = gpus[1]
tf.config.set_visible_devices([gpu], "GPU")
tf.config.experimental.set_memory_growth(gpu, True)
nclasses = 3
d_model = 8
# A custom Data Generator, creating random samples
class DG(keras.utils.PyDataset):
def __init__(self):
self.nSamplesWithAlternatingShapes = 10
super().__init__()
self.last = datetime.datetime.now()
self.time_taken={}
def __len__(self):
return 100
def __getitem__(self, index):
n = self.nSamplesWithAlternatingShapes
if index < self.nSamplesWithAlternatingShapes:
n = index
x = np.random.random((32, n + 10, d_model))
y = np.random.random((32, nclasses))
y = np.argmax(y, axis=1)
y = tf.keras.utils.to_categorical(y, num_classes=nclasses)
now = datetime.datetime.now()
# print(f"\n{round((now-self.last).total_seconds(), ndigits=3)} secs elapsed")
self.time_taken[index]=(now-self.last).total_seconds()
self.last = now
return tf.convert_to_tensor(x), y
def printElapsedTime(self):
alternatingIndices=list(range(self.nSamplesWithAlternatingShapes))
alternatingSizeSamplesDurations=[self.time_taken[i] for i in self.time_taken if i in alternatingIndices]
time_per_alternating_shape=np.mean(alternatingSizeSamplesDurations)
print(f"time per step with alternating shapes : {time_per_alternating_shape:.5f} sec")
similarSizedSamplesDurations=[self.time_taken[i] for i in self.time_taken if i not in alternatingIndices]
time_per_similar_shape=np.mean(similarSizedSamplesDurations)
print(f"time per step with similar sized shapes : {time_per_similar_shape:.3f} sec")
# A Test layer, incorporating a MultiHeadAttention Layer
class TestLayer(tf.keras.layers.Layer):
def __init__(self):
super().__init__()
# some dummy multi head attention
self.mha = tf.keras.layers.MultiHeadAttention(
num_heads=1, value_dim=d_model, key_dim=d_model, dropout=0.3
)
self.dense = tf.keras.layers.Dense(nclasses, activation="softmax")
def call(self, x):
x = self.mha(x, x, x)
pool = tf.keras.layers.GlobalAveragePooling1D(data_format="channels_last")(x)
fin = self.dense(pool)
return fin
if __name__ == "__main__":
# The first dimension is the sequence dimension
input = tf.keras.layers.Input((None, d_model), name="test", dtype=np.float32)
output = TestLayer()(input)
model = tf.keras.models.Model(inputs=[input], outputs=output)
model.compile(
optimizer="adam",
loss="categorical_crossentropy",
metrics=[tf.keras.metrics.CategoricalAccuracy(name="OA")],
)
dg=DG()
model.fit(dg, shuffle=False)
dg.printElapsedTime()
print("done")
Hi @david-eidmann -
As per your adjusted code I am getting time per step in 0.20116 sec which is proper.
100/100 ━━━━━━━━━━━━━━━━━━━━ 4s 28ms/step - OA: 0.3150 - loss: 1.1009
time per step with alternating shapes : 0.20559 sec
time per step with similar sized shapes : 0.023 sec
done
Here is GPU configuration which gives above result:
GPU Configuration:
Graphics/Displays:
Apple M1 Pro:
Chipset Model: Apple M1 Pro
Type: GPU
Bus: Built-In
Total Number of Cores: 14
Vendor: Apple (0x106b)
Metal Support: Metal 3
Can you share GPU configuration details(i.e chipset model,total number of cores etc) ?
Hi - this is my graphics card and it is utilized during training: It has 640 Tensor Cores. I do not think this issue has anything to do with the graphics card, but is due to the retracing algorithm in keras/tensorflow.
Sure, 0.20116 sec per step is proper, but this here is only a small example. My real model has thousands of times the amount of parameters and the full GPU utilization is necessary. In this example, the performance is 10 times as high on my machine, when using similar shaped batches. On your machine the factor is only 2 how it seems. However, with larger models the effect gets even bigger.
I assume keras is retracing the graph for every new shape. There must be some way to avoid the retracing algorithm, as it is not necessary here.
Hi, I already asked this question on stackoverflow, but didn't get any responses. So I'll try here: I am trying to develop a transformer sequence to vector model but encounter performance issues. I am working with a Tesla V100-PCIE-16GB. Whenever the model encounters an unseen sequence length, the (training-) step takes much longer. I assume this is due to the tracing algorithm, creating the tensorflow graph. However, I do not get any retracing warnings or so. I know there is the @tf.function(reduce_retracing=True) decorator, but I would have to rewrite the MultiHeadAttention class from scratch and decorate every function. I am using tensorflow Version 2.17.0.
Here is code for replication. In this snippet I have a small dummy network, training on 100 samples. The first 10 batches have varying shapes and take around 1.5 seconds per batch. The last 90 batches have similar shapes, taking 0.002 seconds per batch.
How to approach this issue?