Closed realsammyt closed 1 year ago
ExLlama has no multimodal support at the moment. It supports Llama models, not Llava or any other variant that is specifically designed to work with HF Transformers. I would like it to at some point, but I don't have 48 hours a day to work on it, so it's going to take time. And it may not happen at all if something else pops up in the meantime that ends up taking priority.
Thanks for the feedback, and for the work you have done this far. A little more insight if/when the time comes around. At the front end of things, am not loading the raw multimodal LLava models, typically I'm testing and using llama models that support multimodal like anon8231489123_vicuna-13b-GPTQ-4bit-128g
and llama-7b-4bit
. These do run using exllama as well GPTQ-for-llama and AutoGPTQ. Just when an image is attached, I guess there's something in the multimodal pipelines or the ViT models underneath that causes the length to shoot 7x over the dimension size...
Anyways, thanks for the super speed gains for text to text fun, it's really impressive work you've done.
As far as I can tell Llava encodes the image into tokens and feeds them into the forward pass along with the text tokens. The patched Llama model then treats those image tokens very differently. Not sure what exactly, yet, but there's definitely a lot more code in the self-attn function.
You can easily extend ExLlama's context beyond 2048 just by changing max_seq_len
in the config if you want to experiment and see what happens, but I doubt it would work.
Yes the Ui limited me from getting the sequence length up to the number but I’ll test out a hardcoded hack later and confirm. I did get smaller values with smaller images, my guess is it’s related to how the ViT chops up the image into smaller chunks to analyze parts of the whole individually, but that’s just a hunch. If I get any further insight I’ll be sure to let you know, thanks for playing!
Gonna wrap this up with some semi confirmation. I managed to use a simple greyscale image, which dropped the context to about 10000, hacked up the max context values to 12000 to clear the threshold and did get some text back. At first I thought it might actually be working and it was identifying the pyramid gradient as a star. XT-W4D. Wishful thinking :)
No news here? Exllama still does not work properly with multimodal LLM's :(
It's still just a Llama implementation, yes. The various multimodal models are all implemented as Transformers patches, replacing functions specific to Transformers' LlamaForCausalLM
or AutoModel
. So they don't work in ExLlama, the same way Photoshop plugins don't work in Krita. I just don't have time to re-implement all the different models that people want more efficient versions of. Unless someone wants to contribute the code, I can't see it happening anytime soon.
i want to impl efficient version of minigpt4. i need to modify generate_simple method ( add input_embeddings which is concat of image embedding and text embedding)of ExLlamaGenerator.
do you have some hints? thanks
The model forward pass currently doesn't accept embeddings, only input IDs. So that's where you'd want to start. Add a parameter that takes an embedding tensor, then prepend that to the hidden state after the token embeddings. I don't know how minigpt works, specifically, but I imagine you'd also want to pass a mask along the whole forward pass so you don't end up applying positional embeddings to the entire hidden state.
Great work with this loader, I'm seeing 5x i/s improvements in Ooba and was hopeful that it would help serve up some gains when using ooba's multimodal extension (confirmed working in my current setup (Windows 10, RTX 2080Ti, 11 GB VRAM, 96 GB RAM, 11.8 Cuda, 2.0.1 Torch, with Llava or miniGPT pipelines at either 7b and 13b).
When attempting to use exllama as the loader with any of the 4 MM setups, regular text chat or instruct work well and much faster but as soon as attempting to use the multimodal extension to include a photo I get this error. Maybe you can point me in the right direction to try and resolve this?