Open karpathy opened 2 months ago
Just posting some notes here on my research of how to remove all of the CG related code to remove the dependency:
sum = cg::reduce(warp, sum, cg::plus<float>{});
Can be replaced with the following
__device__ float warpReduceSum(float val) {
for (int offset = 16; offset > 0; offset /= 2) {
val += __shfl_xor_sync(0xFFFFFFFF, val, offset);
}
return val;
}
sum = warpReduceSum(sum)
Without the need for any thread syncs.
Also the other variables that are used can be replaced by the following:
int warpSize = 32;
int laneId = threadIdx.x % warpSize;
int warpId = threadIdx.x / warpSize;
int warpsPerBlock = (blockDim.x / warpSize);
warp.thread_rank() == laneId
warp.size() == warpSize
warp.meta_group_size() == warpsPerBlock
warp.meta_group_rank() == warpId
I have replaced most of the kernel to test for performance improvement and I was not able to see any noticable change by removing the cooperative groups.
in many cases, I also find it quite convenient to just have a blockSize of 32 in x direction, and the rest in y direction. Then threadIdx.x corresponds to laneId and threadIdx.y is warpId. Doesn't work when the block naturally already uses the other block dims.
We use a lot of cooperative groups functionality in our kernels. This is an additional dependency that is likely mildly convenient, but it is also likely that the code could be written without them, without too much added complexity, and just as fast. As a general feature ideally llm.c is very careful in the "dependency surface" of its code, which would make it very portable, easy to skim/read even if slightly longer, and easy to run or port to any hardware, old/new/edge/exotic/ or otherwise unthought of.
I would accept PRs that develop cooperative-groups-free kernels in
dev/cuda
that:1) aren't too much more complex or more LOC 2) have the same speed
On top of
dev/cuda
I'd be happy to merge these into "mainline"train_gpt2.cu
and the fp32 versiontrain_gpt2fp32.cu
.