haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
19.28k stars 2.12k forks source link

[Question] Quantizing to use with oobabooga/text-generation-webui #310

Open chigkim opened 1 year ago

chigkim commented 1 year ago

Question

I downloaded llava-llama-2-13b from: https://huggingface.co/liuhaotian/llava-llama-2-13b-chat-lightning-preview

Then I've quantized the model to 4-bit using .

git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
cd GPTQ-for-LLaMa
pip install -r requirements.txt
python setup_cuda.py install
python llama.py ../models/liuhaotian_llava-llama-2-13b-chat-lightning-preview c4 --wbits 4 --true-sequential --groupsize 128 --save_safetensors ../models/liuhaotian_llava-llama-2-13b-chat-lightning-preview/liuhaotian_llava-llama-2-13b-chat-lightning-4bit-128g.safetensors

Then I loaded with --multimodal-pipeline llava-13b.

python server.py --verbose --share --chat --model-dir models --model liuhaotian_llava-llama-2-13b-chat-lightning-preview --loader exllama --max_seq_len 4096 --xformers --no-stream --deepspeed --multimodal-pipeline llava-13b

I tried many different images, but the descriptions are completely incorrect. I.E. I submitted a picture a dog, and it said woman holding an umbrella.

Since the model loads and outputs proper English, I assume it's quantized correctly.

I tried --load-4bit with model_worker, but it doesn't seem to support GPTQ format.

I'd appreciate any tip!

Thanks,

haotian-liu commented 1 year ago

Hi @chigkim

Are you testing from text-gen-webui or some other repo? (not familiar with these)

Not sure if this is due to the template, as we use a prompt template similar to LLaMA-2 for llava-llama-2-chat series (something like below prompt), which is different from vicuna template.

[INST] <<SYS>>
You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.
<</SYS>>

<image> [/INST]

Maybe I will have some bandwidth to try this out myself either Friday or this weekend. Any pointer in the repositories I should try first? Thanks.

chigkim commented 1 year ago

Thanks for your response! I used this repo. https://github.com/oobabooga/text-generation-webui Here's my quantized model. https://drive.google.com/drive/folders/1-njjlAXE8JD_UnccZ15geFIMMBU5PZKC

After clone, you need to Put the model folder inside text-generation-webui/models. You should have text-generation-webui/models/liuhaotian_llava-llama-2-13b-chat-lightning-preview/llava-llama-2-13b-chat-4bit-128g.safetensors.

Here's my setup code.

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt
pip install deepspeed mpi4py xformers
python server.py --verbose --share --chat --model-dir models --model liuhaotian_llava-llama-2-13b-chat-lightning-preview --loader exllama --max_seq_len 4096 --xformers --no-stream --deepspeed --multimodal-pipeline llava-13b

Once you open the link from Gradio, go to Chat Settings > Instruction template > and edit the Context. Go back to Text generation and choose instruct.

Here's my Colab notebook that you can just run that will automatically download and load the my quantized model.

https://colab.research.google.com/drive/1n9Tq9XmmTElkRHazVKi2CUaVTqVWFjfP?usp=sharing

Hope this will get you started.

chigkim commented 1 year ago

Also I opened an issue on oobabooga/text-generation-webui. https://github.com/oobabooga/text-generation-webui/issues/3293

AlvinKimata commented 1 year ago

Thanks for your response! I used this repo. https://github.com/oobabooga/text-generation-webui Here's my quantized model. https://drive.google.com/drive/folders/1-njjlAXE8JD_UnccZ15geFIMMBU5PZKC

After clone, you need to Put the model folder inside text-generation-webui/models. You should have text-generation-webui/models/liuhaotian_llava-llama-2-13b-chat-lightning-preview/llava-llama-2-13b-chat-4bit-128g.safetensors.

Here's my setup code.

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt
pip install deepspeed mpi4py
pip install -U num2words omegaconf xformers
python server.py --verbose --share --chat --model-dir models --model liuhaotian_llava-llama-2-13b-chat-lightning-preview --loader exllama_hf --max_seq_len 4096 --xformers --no-stream --deepspeed --multimodal-pipeline llava-13b

Once you open the link from Gradio, go to Chat Settings > Instruction template > and edit the Context. Go back to Text generation and choose instruct. Hope this will get you started.

I used the instructions above when loading the model. The setup works but when I pass in an image to the model, I get an error:

You: <img src="">
Assistant:
--------------------

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gradio/routes.py", line 427, in run_predict
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1323, in process_api
    result = await self.call_function(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1067, in call_function
    prediction = await utils.async_iteration(iterator)
  File "/usr/local/lib/python3.10/dist-packages/gradio/utils.py", line 336, in async_iteration
    return await iterator.__anext__()
  File "/usr/local/lib/python3.10/dist-packages/gradio/utils.py", line 329, in __anext__
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/dist-packages/gradio/utils.py", line 312, in run_sync_iterator_async
    return next(iterator)
  File "/content/text-generation-webui/modules/chat.py", line 303, in generate_chat_reply_wrapper
    for i, history in enumerate(generate_chat_reply(text, state, regenerate, _continue, loading_message=True)):
  File "/content/text-generation-webui/modules/chat.py", line 288, in generate_chat_reply
    for history in chatbot_wrapper(text, state, regenerate=regenerate, _continue=_continue, loading_message=loading_message):
  File "/content/text-generation-webui/modules/chat.py", line 212, in chatbot_wrapper
    for j, reply in enumerate(generate_reply(prompt + cumulative_reply, state, stopping_strings=stopping_strings, is_chat=True)):
  File "/content/text-generation-webui/modules/text_generation.py", line 28, in generate_reply
    for result in _generate_reply(*args, **kwargs):
  File "/content/text-generation-webui/modules/text_generation.py", line 211, in _generate_reply
    for reply in generate_func(question, original_question, seed, state, stopping_strings, is_chat=is_chat):
  File "/content/text-generation-webui/modules/text_generation.py", line 252, in generate_reply_HF
    question, input_ids, inputs_embeds = apply_extensions('tokenizer', state, question, input_ids, None)
  File "/content/text-generation-webui/modules/extensions.py", line 207, in apply_extensions
    return EXTENSION_MAP[typ](*args, **kwargs)
  File "/content/text-generation-webui/modules/extensions.py", line 108, in _apply_tokenizer_extensions
    prompt, input_ids, input_embeds = getattr(extension, function_name)(state, prompt, input_ids, input_embeds)
  File "/content/text-generation-webui/extensions/multimodal/script.py", line 89, in tokenizer_modifier
    prompt, input_ids, input_embeds, total_embedded = multimodal_embedder.forward(prompt, state, params)
  File "/content/text-generation-webui/extensions/multimodal/multimodal_embedder.py", line 172, in forward
    prompt_parts = self._embed(prompt_parts)
  File "/content/text-generation-webui/extensions/multimodal/multimodal_embedder.py", line 148, in _embed
    embedded = self.pipeline.embed_images([parts[i].image for i in image_indicies])
  File "/content/text-generation-webui/extensions/multimodal/pipelines/llava/llava.py", line 75, in embed_images
    image_forward_outs = self.vision_tower(images, output_hidden_states=True)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py", line 941, in forward
    return self.vision_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py", line 866, in forward
    hidden_states = self.embeddings(pixel_values)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/clip/modeling_clip.py", line 195, in forward
    patch_embeds = self.patch_embedding(pixel_values)  # shape = [*, width, grid, grid]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: weight should have at least three dimensions

What could be the issue?

chigkim commented 1 year ago

@AlvinKimata, My apology! I copied a wrong line to launch the server. The loader exllama_hf throws the error you got. Loading with exllama (without_hf) should work. --loader exllama

python server.py --verbose --share --chat --model-dir models --model liuhaotian_llava-llama-2-13b-chat-lightning-preview --loader exllama --max_seq_len 4096 --xformers --no-stream --deepspeed --multimodal-pipeline llava-13b

@haotian-liu Here's my Colab notebook that you can just run.

https://colab.research.google.com/drive/1n9Tq9XmmTElkRHazVKi2CUaVTqVWFjfP?usp=sharing

haotian-liu commented 1 year ago

Hi @chigkim @AlvinKimata

I have looked into the oobabooga text-gen-ui this weekend, and I got our LLaMA-2 checkpoint working with some modifications to the code. I am working on a PR and plan to finalize it when we release our new LLaMA-2-based checkpoints in the coming days.

To try it out with the text-gen-ui, you can for now use my fork, and download my quantized checkpoints here and put it under models folder.

I have tested the following command and it works:

python server.py \
    --chat \
    --model-dir models \
    --model llava-llama-2-13b-chat-lightning-4bit-128g \
    --multimodal-pipeline llava-llama-2-13b

Two things I may need still look into:

  1. It seems that when exllama is used (--loader exllama), the multimodal plugin is not properly loaded. I have submitted an issue: https://github.com/oobabooga/text-generation-webui/issues/3378. So for now, I need to resort to AutoGPTQ for this.
  2. The quantization quality. It seems that the quantized checkpoints I generated using GPTQ-for-LLaMa is not compatible with AutoGPTQ, and it generates garbled responses. I tried some other checkpoints that are quantized by TheBloke, and some of them also generates garbled outputs (e.g. TheBloke/Llama-2-7b-Chat-GPTQ:gptq-4bit-32g-actorder_True). As a workaround, I now quantize the checkpoints using AutoGPTQ, and the checkpoint is uploaded to HF as provided above. I am not sure about the quantization quality yet.

What I may need help for:

  1. What are the ways to improve the quantization quality of LLaVA?
  2. Can the quantized checkpoints using GPTQ-for-LLaMa be compatible with AutoGPTQ?
Don-Chad commented 1 year ago

From what I know AutoGPTQ does something smart with the data - working from the training prompts to do a high quality compression, to offer 32 bit quality at 4 bit speed. Not sure GPTQ does the same.

Running with Exllama would offer a great speed advantage, benchmarks come out at about double the speed. So that would be cool. I think it should be possible without major changes to the exllama engine and to web-generation-ui, since the visual analysis(blip/clip) is already separate from the language model engine ,right?

Don-Chad commented 1 year ago

@haotian-liu Would you please want to share the settings for the AutoGPTQ quantisation - if any? I would like to see if I can get GPTQ-for-LLaMa working.

haotian-liu commented 1 year ago

Hi @Don-Chad

Thank you for sharing the info and for offering the help!

Regarding exllama, it seems to be some compatibility issue with the text-gen-ui itself, as also confirmed by https://github.com/oobabooga/text-generation-webui/pull/3377#issuecomment-1658311157

For AutoGPTQ quantization I used for creating this checkpoint. I basically use the sample script from AutoGPTQ repo, which I attach here.

You need to make a few modifications to your checkpoints though to properly run the script:

Download https://huggingface.co/liuhaotian/llava-llama-2-13b-chat-lightning-preview to local Make following edits to the config.json LlavaLlamaForCausalLM -> LlamaForCausalLM "model_type": "llava" -> "llama"

Note that I am quite new to these quantization techniques, so they are probably not the best parameters to use. Please share if you have any thoughts or findings to get better quality/compatibility.

Thanks.

chigkim commented 1 year ago

Whatever worth, I got the quantized model using oobabooga/GPTQ-for-LLaMa to work with AutoGPTQ loader. If you quantize using qwopqwop200/GPTQ-for-LLaMa, it doesn't work. Here's what I did.

  1. Modified config.json as @haotian-liu described above.
  2. Quantized using the command below. python repositories/GPTQ-for-LLaMa/llama.py models/liuhaotian_llava-llama-2-13b-chat-lightning-preview c4 --wbits 4 --true-sequential --groupsize 128 --save_safetensors models/liuhaotian_llava-llama-2-13b-chat-lightning-preview/llava-llama-2-13b-chat-4bit-128g.safetensors
  3. Checked out the pull request.
    git fetch origin pull/3377/head:br3377
    git checkout br3377
  4. Loaded using --multimodal-pipeline llava-llama-2-13b. The output was correct. However, this still doesn't work with exllama. Hopefully they could fix it soon, because exllama is significantly faster.
anan1030 commented 1 month ago

Hi, anyone know the python code to do inference without using the webui?