Open end-me-please opened 1 week ago
I don't know yet.
The text model is probably straightforward since supposedly it's very similar to Llama, just with q/k norms which is already supported for other architectures. For the vision side of it, they're only releasing the weights for image input, and all the code for embedding images is provided so it's probably doable as well. The main hurdle is integrating it into the API in a sensible way when you have a bunch of caching behavior to account for ("image tokens" is a bit of a misnomer, they're actually embeddings.)
Either way, I'd prefer not to spend too much time on it until there's a HF implementation, since I don't want to have a bunch of separate workflows for quantizing models in various custom, raw PyTorch formats.
The image part is a vqgan. It would work both ways if the text model is made to output image tokens.
https://ai.meta.com/blog/meta-fair-research-new-releases/
Any idea if this could be supported in exl2 anytime soon?