gpt2_forward adding CUDA streams with events for async layered operations, cache prefetching for efficient data access with high temporal locality

In the forward pass in gpt2_train.cu

adding cuda streams with events for async layered operations
added offset precalculations and cache prefetching for efficient data access with high temporal locality

changes

cuda streams and events
- four independent cuda streams: input copy, target copy, compute, loss
- non-blocking streams overlap data transfers with computation and loss calculation
- prioritized streams to minimize interference
cache prefetching
- prefetching offsets into cache for enhanced cpu-gpu data handling
- high temporal locality hints

the goal here is to achieve performance improvements by reducing waiting time for memory transfers through asynchronous operations and optimizing data access patterns to reduce cache miss rates. additionally, aiming for more efficient data access by overlapping data transfers with computation and loss calculation, and using high temporal locality hints to improve cache efficiency and execution speed

the impact may not be as noticeable for small models

the code is heavily commented in the sections modified to both document and educate

testing

On a single sm_86 Nvidia GPU a6000 I showed performance increases on various runs showing reduced iteration times and tokens/sec, results may vary. would be great to get others feedback

karpathy / llm.c

gpt2_forward adding CUDA streams with events for async layered operations, cache prefetching for efficient data access with high temporal locality #610

changes

testing