Incorrect benchmarking results

eawer commented 10 months ago

Greetings! I took Vector Addition tutorial and added a couple of other benchmarks (for numpy and tensorflow), and the benchmark results don't seem accurate anymore - the graph and the dataframe show that tensorflow is ~1k times faster that torch and triton implementations, thought performance testing with %%timeit gives the same results:

%%timeit
with tf.device('/GPU:0'):
    x = tf.random.uniform(shape=(268435456,), dtype=tf.float32)
    y = tf.random.uniform(shape=(268435456,), dtype=tf.float32)
    tf.add(x, y)
# 22.8 ms ± 9.69 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
x = torch.rand(268435456, device='cuda', dtype=torch.float32)
y = torch.rand(268435456, device='cuda', dtype=torch.float32)
torch.add(x, y)
# 22.9 ms ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Results of benchmark.run:

Complete code is here - https://colab.research.google.com/drive/16HF6k5wGoqfnDg0uqHKe_B0vXTpzPDfn?usp=sharing

ThomasRaoux commented 10 months ago

What hardware is it on? Is this A100? The peak bandwidth on A100 is 1555Gb/s so the measurement for TF are most likely wrong.

eawer commented 10 months ago

It's on free Colab's T4 GPU

triton-lang / triton

Incorrect benchmarking results #2720