turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 272 forks source link

Llava support? #357

Closed kinchahoy closed 3 months ago

kinchahoy commented 7 months ago

Hey folks,

I'd love to try doing a fast quantization of a LLaVA model or perhaps MOE-LLaVA. Will ExLlama work out of the box, or will I need to modify things?

I'm also interested in if exploring exllama's techniques can help with IQ1_S quantizations (currently all the rage thanks to the trinary paper). Right now going from 3 bit to 1.68 bit 4xs perplexity and the improved techniques here should help. Would love to chat with folks who might be looking at it.

turboderp commented 7 months ago

You'd need to implement CLIP, or rewrite ExLlamaV2 to integrate with Transformers. It's not something I currently have on the roadmap, but you're welcome to have a stab at it of course.

As for BitNet, it's not a quantization technique, it's a model architecture with binary/ternary quantization built in. As such it requires pretraining models from scratch, which costs millions of dollars. If they ever release the 3B model they mention in their paper it's probably worth adding support for in the hopes that larger models will appear in the future (and the architecture turns out to scale well enough). But until then it's not all that relevant.

alkeryn commented 6 months ago

@turboderp do you think it'd be possible to train a bitnet model to approximate a non bitnet model layer by layer ?

so for example take a llama model, and train some bitnet to approximate layer one then layer 2 etc.. until you got a crude approximation of the whole model.

doing it per layer is most definitely not as good as a full training on a dataset but maybe it could be not too terrible idk, especially if we do a final finetune after the fact.

turboderp commented 6 months ago

Layer-wise distillation has been studied a bunch, and it's possibly worth pursuing for binarization. I would expect it to take quite a lot of compute for a large model, though. And ultimately it rests on the assumption that the FP16 hidden states are useful targets for the BitNet layers.

aliencaocao commented 6 months ago

On the same goal here - would it be possible to just input the embedding tensor from CLIP (in original fp16 HF impl) into exllamav2 running the llm part of llava?

turboderp commented 3 months ago

I'll close this here and refer to #399.