Closed srush closed 1 year ago
I think if you're willing to pin to CUDA 12.2 (probably not advised, I guess) there's now mmap-to-GPU which might help avoid the manual copies: https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/
Oh that's cool. Maybe I'll try that out.
Playing around with a CUDA kernel for GPT-Q in triton. Was able to get a version working but it was too slow. Will probably get around to finishing this next week.