karpathy / llm.c

LLM training in simple, raw C/CUDA
MIT License
23.57k stars 2.64k forks source link

2D and 3D tile divisions so that permutation coordinates can be read from threadIdx and blockIdx #406

Open ChrisDryden opened 4 months ago

ChrisDryden commented 4 months ago

Supposedly the permutation kernels, even though they are mostly memory bound can reduce the amount of division and do thread coarsening by having a 2d or 3d grid and not have to do any division in the kernel itself

Looking into this from the advice of @ngc92:

integer divisions are really expensive, but I don't think they will matter much in a kernel as memory-bound as this. I guess the first thing to do would be some thread coarsening, so that the divisions are amortized, and possibly a 2D or 3D grid, so that you don't even have to do the divisions at all, and can just read off individual coordinates from threadIdx and blockIdx. 

Creating this issue to track progress on this

ChrisDryden commented 4 months ago

Where this came up in discussion was regarding the possibility of adding all of the constants that can be passed into the kernel directly, such as the following values: https://github.com/karpathy/llm.c/blob/master/train_gpt2.cu#L689

Wouldn't neccesarily add more lines of code, just reorganize where the calculations are done. From a theoretical standpoint this should speed things up since it reduces the amount of calculations by a factor of how many kernels are used

Karliz24 commented 4 months ago

👍🏻

ChrisDryden commented 4 months ago

Created an example implementation here: https://github.com/karpathy/llm.c/pull/459 but it doesn't seem to be working properly