turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

Chameleon support #515

Open end-me-please opened 1 week ago

end-me-please commented 1 week ago

https://ai.meta.com/blog/meta-fair-research-new-releases/

Any idea if this could be supported in exl2 anytime soon?

turboderp commented 1 week ago

I don't know yet.

The text model is probably straightforward since supposedly it's very similar to Llama, just with q/k norms which is already supported for other architectures. For the vision side of it, they're only releasing the weights for image input, and all the code for embedding images is provided so it's probably doable as well. The main hurdle is integrating it into the API in a sensible way when you have a bunch of caching behavior to account for ("image tokens" is a bit of a misnomer, they're actually embeddings.)

Either way, I'd prefer not to spend too much time on it until there's a HF implementation, since I don't want to have a bunch of separate workflows for quantizing models in various custom, raw PyTorch formats.

Ph0rk0z commented 1 week ago

The image part is a vqgan. It would work both ways if the text model is made to output image tokens.