turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.66k stars 214 forks source link

is it too much of me to ask for an MPI option like llama.cpp? #286

Closed hiqsociety closed 9 months ago

hiqsociety commented 9 months ago

i was always looking for the optimum (cheapest) way to run the large models. kind of tired of going for the extremes. (coz i will need to "upgrade" and that means my other devices are "obsolete") however, is an MPI option in the roadmap? would really hope to see it happen.

Thx in advance for the great work by the way.

turboderp commented 9 months ago

This may or may not be a stupid question, but what is MPI?

hiqsociety commented 9 months ago

@turboderp it's a stupid question if u try this on raspberry pi cluster like this: https://github.com/ggerganov/llama.cpp/issues/2164

basically mpi is enabling clustering for llama models

  1. but it's a serious question for people like me who will be using this a lot!

  2. do you have an explanation why u call this project exllama when it's obviously the fastest way to run llama? i was always looking at llama.cpp until i focused on speed and come across exllama. i'm asking so because ex sounds obsolete. u should change the name to prollama or eLlama or LlamaX. (when i saw Ex... i thought it was obsolete and not to be taken seriously and i skipped it previously until i was researching and exllama keeps popping up with the impressive token/s results. normally github project with anything "ex" means they are not going to maintain it anymore, abandon projects. and seriously, it sounds things done before llama was around. pre-llama)

  3. by the way, MPI "must" be the way to go coz... who doesnt want to run falcon 180b? i guess it will always go bigger. maybe 960b by next year? pls help us not spend fortunes on cloud...

again, just my thoughts.

turboderp commented 9 months ago

I don't know, ExLlama is really focused on consumer GPUs. This would be asking for a complete rewrite so it can run on clusters of embedded devices instead. And it basically boils down to "can this project be llama.cpp instead?" So, I don't really think this is realistic.

As for the name, I didn't really give it much thought. Doesn't have those connotations to me, is all I can say I guess. Think of it as "extra" maybe?

And it's not categorically the fastest way to run Llama, either. It really depends on the use case.

hiqsociety commented 9 months ago

@turboderp

  1. basically running solely on gpu vram is fine but ability to distribute task as "clusterized/sharded" form (on consumer gpu etc).

  2. the speed improvement is kind of evident for long running processes. i just saw v2 and extremely impress with the direction and performance optimized! do pls consider mpi (gpu only) as part of the roadmap.

  3. what's the fastest way to run llama for consumer then? i thought most benchmark say exllama is the way to go.

p.s. : v2 is impressive. u guys are doing great.

turboderp commented 9 months ago

what's the fastest way to run llama for consumer then? i thought most benchmark say exllama is the way to go.

Benchmarks tend to become outdated very quickly. When ExLlama first came out there was no CUDA support at all in llama.cpp at all, for instance, AutoGPTQ didn't exist and GPTQ-for-Llama was still using essentially the same kernel written for the original GPTQ paper. Since then, llama.cpp has had a huge amount of work put into it, AutoGPTQ has included the ExLlama (v1) kernel, and there's also AWQ, vLLM... something called OmniQuant..?

ExLlama is definitely a fast option, and depending on what you need to do, what your hardware setup is, etc., it may be the fastest in your case. If you want to run an inference server for an online chat service, probably you should look at TGI or vLLM or something. If you want to run on Apple Silicon, llama.cpp is (I think?) the only way to go. If you have an older NVIDIA GPU (Pascal or earlier), AutoGPTQ is probably still the best option. So it all depends.