Open ChrisDryden opened 6 months ago
Where this came up in discussion was regarding the possibility of adding all of the constants that can be passed into the kernel directly, such as the following values: https://github.com/karpathy/llm.c/blob/master/train_gpt2.cu#L689
Wouldn't neccesarily add more lines of code, just reorganize where the calculations are done. From a theoretical standpoint this should speed things up since it reduces the amount of calculations by a factor of how many kernels are used
👍🏻
Created an example implementation here: https://github.com/karpathy/llm.c/pull/459 but it doesn't seem to be working properly
Supposedly the permutation kernels, even though they are mostly memory bound can reduce the amount of division and do thread coarsening by having a 2d or 3d grid and not have to do any division in the kernel itself
Looking into this from the advice of @ngc92:
Creating this issue to track progress on this