karpathy / llm.c

LLM training in simple, raw C/CUDA
MIT License
21.28k stars 2.31k forks source link

gpt2_forward adding CUDA streams with events for async layered operations, cache prefetching for efficient data access with high temporal locality #610

Open bgorlick opened 1 week ago

bgorlick commented 1 week ago

In the forward pass in gpt2_train.cu

changes

the goal here is to achieve performance improvements by reducing waiting time for memory transfers through asynchronous operations and optimizing data access patterns to reduce cache miss rates. additionally, aiming for more efficient data access by overlapping data transfers with computation and loss calculation, and using high temporal locality hints to improve cache efficiency and execution speed

the impact may not be as noticeable for small models

the code is heavily commented in the sections modified to both document and educate

testing

On a single sm_86 Nvidia GPU a6000 I showed performance increases on various runs showing reduced iteration times and tokens/sec, results may vary. would be great to get others feedback