Open awayzjj opened 3 months ago
Looks good to me! We need someone who can to trigger the CI and make sure everything passes but it feels like a very safe change.
I'm a bit surprised that this actually gives this much of an improvement. IIRC, the original idea here was that we're anyway global memory bound, so we might as well use the numerically stable formulation.
@gordicaleksa @karpathy Hi, since kernel 4 already used a more clever way to estimate variance,
var(x) = mean(x**2) - mean(x)**2
,I am wondering if kernel 6, which is used in the main
train_gpt2.cu
, can adapt the same technique?I conducted tests on the A100 80G (from Modal) with the following results:
Before the change:
After the change:
It seems work for every blocksize!