srush / llama2.rs

A fast llama2 decoder in pure Rust.
MIT License
1.01k stars 56 forks source link

[wip] Cuda #26

Closed srush closed 1 year ago

srush commented 1 year ago

Playing around with a CUDA kernel for GPT-Q in triton. Was able to get a version working but it was too slow. Will probably get around to finishing this next week.

rachtsingh commented 1 year ago

I think if you're willing to pin to CUDA 12.2 (probably not advised, I guess) there's now mmap-to-GPU which might help avoid the manual copies: https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/

srush commented 1 year ago

Oh that's cool. Maybe I'll try that out.