gpt2_forward adding CUDA streams with events for async layered operations, cache prefetching for efficient data access with high temporal locality #610
adding cuda streams with events for async layered operations
added offset precalculations and cache prefetching for efficient data access with high temporal locality
changes
cuda streams and events
four independent cuda streams: input copy, target copy, compute, loss
non-blocking streams overlap data transfers with computation and loss calculation
prioritized streams to minimize interference
cache prefetching
prefetching offsets into cache for enhanced cpu-gpu data handling
high temporal locality hints
the goal here is to achieve performance improvements by reducing waiting time for memory transfers through asynchronous operations and optimizing data access patterns to reduce cache miss rates. additionally, aiming for more efficient data access by overlapping data transfers with computation and loss calculation, and using high temporal locality hints to improve cache efficiency and execution speed
the impact may not be as noticeable for small models
the code is heavily commented in the sections modified to both document and educate
testing
On a single sm_86 Nvidia GPU a6000 I showed performance increases on various runs showing reduced iteration times and tokens/sec, results may vary. would be great to get others feedback
In the forward pass in gpt2_train.cu
changes
cuda streams and events
cache prefetching
the goal here is to achieve performance improvements by reducing waiting time for memory transfers through asynchronous operations and optimizing data access patterns to reduce cache miss rates. additionally, aiming for more efficient data access by overlapping data transfers with computation and loss calculation, and using high temporal locality hints to improve cache efficiency and execution speed
the impact may not be as noticeable for small models
the code is heavily commented in the sections modified to both document and educate
testing
On a single sm_86 Nvidia GPU a6000 I showed performance increases on various runs showing reduced iteration times and tokens/sec, results may vary. would be great to get others feedback