turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.75k stars 215 forks source link

Support non-Llama architectures #136

Closed dred0n closed 1 year ago

dred0n commented 1 year ago

exLlama saved GPTQ, I've gone from 6 token/s to over 40, thank you! Currently it's only supports Llama based models.

Here's a few other promising architectures such as: MPT Falcon SalesForce StarCoder ChatGLM

Are there plans to support these other architectures?

EyeDeck commented 1 year ago

Must admit I'm interested in Falcon 40B; a 4-bit quantized model is like 22.5GB, which might just fit on a 24GB card using an exllama-style approach. Currently it requires considerably more, including Transformers overhead. Notably, its use of multi-query attention, as opposed to multi-head attention like LLaMA, allegedly reduces the VRAM required for context by at least an order of magnitude, so (again, allegedly) 2048 context only requires about 240MB of VRAM.

Then again, working out a way to use multi-query attention with LLaMA might be possible too. At least, some of the Meta researchers apparently think so. Appears to require relatively extensive retraining of the model, however—relative to what us plebs without access to megacorporate resources can afford, that is—so it might have to wait until Meta makes a move in that direction and puts out LLaMA v2 or whatever, if they ever do.

dred0n commented 1 year ago

I hope Meta decides to release a version 2 but I'm not counting on it anytime soon. Do you think exllama-style inference would be possible with multi-query attention model like Falcon? If so, what would be involved in implementing the changes?

turboderp commented 1 year ago

I have no plans at the moment to support other architectures, other quantization methods, online APIs (ChatGPT?), or anything like that. I just don't have the time.

As for what would be involved in those changes, it would require studying the model and any existing implementations, writing a second code path for that model, testing it as much as possible, testing that it hasn't broken anything in the other code path, and then maintaining both going forward.