karpathy / llm.c

LLM training in simple, raw C/CUDA
MIT License
23.6k stars 2.64k forks source link

Improve kernel in layerNorm forward: adapt variance estimation method from kernel 4 for use in kernel 6 #645

Open awayzjj opened 3 months ago

awayzjj commented 3 months ago

@gordicaleksa @karpathy Hi, since kernel 4 already used a more clever way to estimate variance, var(x) = mean(x**2) - mean(x)**2,

I am wondering if kernel 6, which is used in the main train_gpt2.cu, can adapt the same technique?

I conducted tests on the A100 80G (from Modal) with the following results:

Before the change:

block_size   32 | time 0.0466 ms | bandwidth 1079.39 GB/s
block_size   64 | time 0.0437 ms | bandwidth 1152.29 GB/s
block_size  128 | time 0.0434 ms | bandwidth 1160.60 GB/s
block_size  256 | time 0.0425 ms | bandwidth 1183.23 GB/s
block_size  512 | time 0.0433 ms | bandwidth 1162.06 GB/s
block_size 1024 | time 0.0437 ms | bandwidth 1151.32 GB/s

After the change:

block_size   32 | time 0.0449 ms | bandwidth 1120.00 GB/s
block_size   64 | time 0.0412 ms | bandwidth 1220.50 GB/s
block_size  128 | time 0.0407 ms | bandwidth 1237.67 GB/s
block_size  256 | time 0.0397 ms | bandwidth 1268.84 GB/s
block_size  512 | time 0.0405 ms | bandwidth 1243.80 GB/s
block_size 1024 | time 0.0412 ms | bandwidth 1221.16 GB/s

It seems work for every blocksize!

ademeure commented 3 months ago

Looks good to me! We need someone who can to trigger the CI and make sure everything passes but it feels like a very safe change.

ngc92 commented 2 months ago

I'm a bit surprised that this actually gives this much of an improvement. IIRC, the original idea here was that we're anyway global memory bound, so we might as well use the numerically stable formulation.