CUDA decoding - Githubissues

jafioti commented 1 year ago

Hey all, great work on integrating cuda support for the prompt tokens. How much work would it be to support GPU decoding? Currently on llama.cpp I can reach about 35 tokens per second on llama 7B on a 2080 super, and I'd love to reach somewhere near that in rust!

Please lmk if there's anything I can do to help this effort.

LLukas22 commented 1 year ago

I'm currently working on adding CUDA acceleration and it's already in place for the LLama architecture. If you're interested in giving it a go, you can check out the branch I'm working on here: https://github.com/LLukas22/llm/tree/cublas-clblast-support

For a test drive, here's a command you can use:

cargo run --release --features cublas -- llama infer -m "C:\Users\lkreu\Downloads\wizardlm-30b.ggmlv3.q4_1.bin" --accelerator-layers 40 --batch-size 512  -p "Write me a short story about a llama riding a crab:"

Now, I'm in the process of implementing acceleration for the other architectures. But, there's a hiccup: some GGML operations don't have CUDA support, which means I have to run some parts on the CPU. My goal is to work around this without having to tweak the existing model implementations. If you've got some ideas on how to tackle this, I'm all ears!

jafioti commented 1 year ago

@LLukas22 This is awesome, thanks for the link. I tried it out on my GPU, and it's a lot faster than pure CPU inference. It does seem to be quite a bit slower than llama.cpp (maybe a quarter of the speed, will run measurements). Is it because it's doing CPU sync after every token to run the callback function?

jafioti commented 1 year ago

Some quick stats: Model: llama-7B llama.cpp - 23.57 ms per token llm cublas-clblast-support - 86.39 ms per token

llama.cpp stat output:

llama_print_timings:        load time =   971.34 ms
llama_print_timings:      sample time =   174.34 ms /   424 runs   (    0.41 ms per token)
llama_print_timings: prompt eval time =   187.43 ms /    10 tokens (   18.74 ms per token)
llama_print_timings:        eval time =  9970.32 ms /   423 runs   (   23.57 ms per token)
llama_print_timings:       total time = 10414.72 ms

llm cublas-clblast-support stat output:

feed_prompt_duration: 209ms
prompt_tokens: 10
predict_duration: 9158ms
predict_tokens: 106
per_token_duration: 86.396ms

LLukas22 commented 1 year ago

Well it's still in a pre-draft stage, i'm guessing the creation of a new eval-context each call kills the performance, but that's something to optimize when we get the acceleration working for all models. There was also a lot of work done on the metal branch, which probably will also help us close the gap a bit.

jafioti commented 1 year ago

@LLukas22 Understandable, is there anything I can do to help this along?

LLukas22 commented 1 year ago

I'm currently waiting on @philpax to review the metal PR. If that gets merged we can start to integrate CUDA acceleration, until then we could think about how we can support architecture which use functions which are not yet CUDA accelerated or we could start to implement these functions as CUDA kernels into GGML/llama.cpp

malv-c commented 1 year ago

can i expect to test it in my orin agx 32g as an EFI app ?

LLukas22 commented 1 year ago

@malv-c i don't know what you mean with EFI app. But if you can setup cuda you should be able to compile it for an ARM based system. But maybe we have to adjust the build.rs script to enable building with cuda acceleration on ARM 🤔

malv-c commented 1 year ago

EFI app to run seriously llm without wasting resources on the poor ubuntu https://github.com/rust-osdev/uefi-rs

jafioti commented 1 year ago

@malv-c Still not sure how this relates to running language models. Why would uefi calls help speed up LLMs?

malv-c commented 1 year ago

llm without loading os is better than llm+os

jafioti commented 1 year ago

That's impractical.

Latency / compute constraints don't come from the OS, but from the model size / CUDA kernels / CPU speed.
CUDA is usually tightly integrated with the OS as well, so you wouldn't be able to use the GPU.
LLMs aren't the only things running, there needs to be some way to interact with the LLM, usually a web server queuing up prompts or some terminal interface. How would that work without an OS?

LLukas22 commented 1 year ago

@malv-c I agree with @jafioti, the scope of this project will be to provide a good, fast and easy to use llm library. If you want to you can thinker around a bit and see if you get it running as an EFI app. But i think the ggml backend we are using will be a very problematic to get running correctly.

malv-c commented 1 year ago

hi Joe i strongly disagree but as i don't know rust i will not output code for now if i find a usable llm ...

Le dim. 25 juin 2023 à 18:27, Joe Fioti @.***> a écrit :

That's impractical.

Latency / compute constraints don't come from the OS, but from the model size / CUDA kernels / CPU speed.

CUDA is usually tightly integrated with the OS as well, so you wouldn't be able to use the GPU.

LLMs aren't the only things running, there needs to be some way to interact with the LLM, usually a web server queuing up prompts or some terminal interface. How would that work without an OS?

— Reply to this email directly, view it on GitHub https://github.com/rustformers/llm/issues/315#issuecomment-1606150496, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESIHJL5EJUTVSTHV6GOQDLXNBRGLANCNFSM6AAAAAAZLBLIU4 . You are receiving this because you were mentioned.Message ID: @.***>

malv-c commented 1 year ago

hi Lukas not simple i agree thanks

Le dim. 25 juin 2023 à 21:06, Lukas Kreussel @.***> a écrit :

@malv-c https://github.com/malv-c I agree with @jafioti https://github.com/jafioti, the scope of this project will be to provide a good, fast and easy to use llm library. If you want to you can thinker around a bit and see if you get it running as an EFI app. But i think the ggml backend we are using will be a very problematic to get running correctly.

— Reply to this email directly, view it on GitHub https://github.com/rustformers/llm/issues/315#issuecomment-1606222562, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESIHJPVMFSIYAUUKQJGL33XNCD4NANCNFSM6AAAAAAZLBLIU4 . You are receiving this because you were mentioned.Message ID: @.***>

LLukas22 commented 1 year ago

Implemented with https://github.com/rustformers/llm/pull/325

rustformers / llm

CUDA decoding #315