turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.73k stars 215 forks source link

Multimodal support #78

Closed realsammyt closed 1 year ago

realsammyt commented 1 year ago

Great work with this loader, I'm seeing 5x i/s improvements in Ooba and was hopeful that it would help serve up some gains when using ooba's multimodal extension (confirmed working in my current setup (Windows 10, RTX 2080Ti, 11 GB VRAM, 96 GB RAM, 11.8 Cuda, 2.0.1 Torch, with Llava or miniGPT pipelines at either 7b and 13b).

When attempting to use exllama as the loader with any of the 4 MM setups, regular text chat or instruct work well and much faster but as soon as attempting to use the multimodal extension to include a photo I get this error. Maybe you can point me in the right direction to try and resolve this?


  File "D:\00\text-generation-webui\modules\text_generation.py", line 300, in generate_reply_custom
    for reply in shared.model.generate_with_streaming(question, state):
  File "D:\00\text-generation-webui\modules\exllama.py", line 68, in generate_with_streaming
    self.generator.gen_begin_reuse(ids)
  File "D:\00\text-generation-webui\repositories\exllama\generator.py", line 191, in gen_begin_reuse
    if reuse < in_tokens.shape[-1]: self.gen_feed_tokens(in_tokens[:, reuse:])
  File "D:\00\text-generation-webui\repositories\exllama\generator.py", line 209, in gen_feed_tokens
    self.model.forward(self.sequence[:, start:-1], self.cache, preprocess_only = True, lora = self.lora)
  File "D:\00\text-generation-webui\repositories\exllama\model.py", line 841, in forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "D:\00\text-generation-webui\repositories\exllama\model.py", line 459, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
  File "D:\00\text-generation-webui\repositories\exllama\model.py", line 381, in forward
    new_keys = cache.key_states[self.index].narrow(2, past_len, q_len)
RuntimeError: start (49) + length (13970) exceeds dimension size (2048).
Output generated in 1.51 seconds (0.00 tokens/s, 0 tokens, context 14020, seed 979644525)```
turboderp commented 1 year ago

ExLlama has no multimodal support at the moment. It supports Llama models, not Llava or any other variant that is specifically designed to work with HF Transformers. I would like it to at some point, but I don't have 48 hours a day to work on it, so it's going to take time. And it may not happen at all if something else pops up in the meantime that ends up taking priority.

realsammyt commented 1 year ago

Thanks for the feedback, and for the work you have done this far. A little more insight if/when the time comes around. At the front end of things, am not loading the raw multimodal LLava models, typically I'm testing and using llama models that support multimodal like anon8231489123_vicuna-13b-GPTQ-4bit-128g and llama-7b-4bit. These do run using exllama as well GPTQ-for-llama and AutoGPTQ. Just when an image is attached, I guess there's something in the multimodal pipelines or the ViT models underneath that causes the length to shoot 7x over the dimension size...

Anyways, thanks for the super speed gains for text to text fun, it's really impressive work you've done.

turboderp commented 1 year ago

As far as I can tell Llava encodes the image into tokens and feeds them into the forward pass along with the text tokens. The patched Llama model then treats those image tokens very differently. Not sure what exactly, yet, but there's definitely a lot more code in the self-attn function.

You can easily extend ExLlama's context beyond 2048 just by changing max_seq_len in the config if you want to experiment and see what happens, but I doubt it would work.

realsammyt commented 1 year ago

Yes the Ui limited me from getting the sequence length up to the number but I’ll test out a hardcoded hack later and confirm. I did get smaller values with smaller images, my guess is it’s related to how the ViT chops up the image into smaller chunks to analyze parts of the whole individually, but that’s just a hunch. If I get any further insight I’ll be sure to let you know, thanks for playing!

realsammyt commented 1 year ago

Gonna wrap this up with some semi confirmation. I managed to use a simple greyscale image, which dropped the context to about 10000, hacked up the max context values to 12000 to clear the threshold and did get some text back. At first I thought it might actually be working and it was identifying the pyramid gradient as a star. XT-W4D. Wishful thinking :)

DenisSergeevitch commented 1 year ago

No news here? Exllama still does not work properly with multimodal LLM's :(

turboderp commented 1 year ago

It's still just a Llama implementation, yes. The various multimodal models are all implemented as Transformers patches, replacing functions specific to Transformers' LlamaForCausalLM or AutoModel. So they don't work in ExLlama, the same way Photoshop plugins don't work in Krita. I just don't have time to re-implement all the different models that people want more efficient versions of. Unless someone wants to contribute the code, I can't see it happening anytime soon.

jianantian commented 1 year ago

i want to impl efficient version of minigpt4. i need to modify generate_simple method ( add input_embeddings which is concat of image embedding and text embedding)of ExLlamaGenerator.

image

do you have some hints? thanks

turboderp commented 1 year ago

The model forward pass currently doesn't accept embeddings, only input IDs. So that's where you'd want to start. Add a parameter that takes an embedding tensor, then prepend that to the hidden state after the token embeddings. I don't know how minigpt works, specifically, but I imagine you'd also want to pass a mask along the whole forward pass so you don't end up applying positional embeddings to the entire hidden state.