delete use of cooperative groups in kernels

karpathy commented 2 months ago

We use a lot of cooperative groups functionality in our kernels. This is an additional dependency that is likely mildly convenient, but it is also likely that the code could be written without them, without too much added complexity, and just as fast. As a general feature ideally llm.c is very careful in the "dependency surface" of its code, which would make it very portable, easy to skim/read even if slightly longer, and easy to run or port to any hardware, old/new/edge/exotic/ or otherwise unthought of.

I would accept PRs that develop cooperative-groups-free kernels in dev/cuda that:

1) aren't too much more complex or more LOC 2) have the same speed

On top of dev/cuda I'd be happy to merge these into "mainline" train_gpt2.cu and the fp32 version train_gpt2fp32.cu.

ChrisDryden commented 2 months ago

Just posting some notes here on my research of how to remove all of the CG related code to remove the dependency:

    sum = cg::reduce(warp, sum, cg::plus<float>{});

Can be replaced with the following

__device__ float warpReduceSum(float val) {
    for (int offset = 16; offset > 0; offset /= 2) {
        val += __shfl_xor_sync(0xFFFFFFFF, val, offset);
    }
    return val;
}

sum = warpReduceSum(sum)

Without the need for any thread syncs.

Also the other variables that are used can be replaced by the following:

int warpSize = 32;
int laneId = threadIdx.x % warpSize;
int warpId = threadIdx.x / warpSize;
int warpsPerBlock = (blockDim.x / warpSize);

warp.thread_rank() == laneId
warp.size() == warpSize
warp.meta_group_size() == warpsPerBlock
warp.meta_group_rank() == warpId

I have replaced most of the kernel to test for performance improvement and I was not able to see any noticable change by removing the cooperative groups.

ngc92 commented 2 months ago

in many cases, I also find it quite convenient to just have a blockSize of 32 in x direction, and the rest in y direction. Then threadIdx.x corresponds to laneId and threadIdx.y is warpId. Doesn't work when the block naturally already uses the other block dims.

karpathy / llm.c

delete use of cooperative groups in kernels #292