Open bhavyajoshi-mahindra opened 5 days ago
This particular error should have been fixed by #9721. Note that vLLM doesn't officially support Windows installations. Please also see #9701
I have switched to Linux (Colab). I have fine-tuned Qwen2-VL (LoRA) using Llama-factory and merged it on the original weights. Then I quantized (GPTQ) the merged weights using AutoGPTQ. Now I want to infer the quantized weights using vLLM. But I got this error :
ValueError Traceback (most recent call last)
[<ipython-input-5-06a68f93c72a>](https://localhost:8080/#) in <cell line: 7>()
5 MODEL_PATH = "/content/drive/MyDrive/LLM/vinplate2-gwen2-vl-gptq-4bit"
6
----> 7 llm = LLM(
8 model=MODEL_PATH,
9 limit_mm_per_prompt={"image": 10, "video": 10},
10 frames
[/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2_vl.py](https://localhost:8080/#) in load_weights(self, weights)
1201 param = params_dict[name]
1202 except KeyError:
-> 1203 raise ValueError(f"Unexpected weight: {name}") from None
1204
1205 weight_loader = getattr(param, "weight_loader",
ValueError: Unexpected weight: model.layers.0.mlp.down_proj.g_idx
Here is my environment. transformers : 4.46.1 vllm : 0.6.3.post2.dev165+g33d25773 torch : 2.4.1+cu121 flash_attn : 2.6.3 CUDA : 12.2 python : 3.10.12 OS : Linux
Can you try using the latest main branch of vLLM? #9772 might already have fixed this issue.
cc @mgoin
Still getting the same error after installing vLLM from the main branch. transformers : 4.46.1 vllm : 0.6.3.post2.dev174+g5608e611.d20241031 torch : 2.5.0+cu124 flash_attn : 2.6.3
Using main before https://github.com/vllm-project/vllm/pull/9817 landed, I am able to load Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4
just fine in vLLM. However as we just landed support for quantizing the vision transformer, this broke GPTQ checkpoints for this model (and many other VLMs using GPTQ are likely broken as well)
@DarkLight1337 this gets into the larger issue we have with enabling quantization for more modules in vLLM, but many quantization methods/configurations do not have proper "ignored" lists of modules
As an example, if you look at Qwen's official GPTQ checkpoint for Qwen2-VL you can see that all of the "model." submodules are quantized but none of the "visual." ones are https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4?show_file_info=model.safetensors.index.json
However within that model's gptq quantization_config, there is nothing specifying that those modules were ignored - it looks like the config should be applied everywhere https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct-GPTQ-Int4/blob/main/config.json#L20-L30
"quantization_config": {
"bits": 4,
"damp_percent": 0.1,
"dataset": null,
"desc_act": false,
"group_size": 128,
"modules_in_block_to_quantize": null,
"quant_method": "gptq",
"sym": true,
"true_sequential": true
},
Luckily not all quant configs have this issue - obviously compressed-tensors has an ignore list, and AWQ has a "modules_to_not_convert" list
Is it feasible to change the model initialization code to switch between the regular and the quantized version based on whether the corresponding weight is available from the model file?
Not easily at all. We commonly rely on the assumption that we can allocate and distribute the model parameters by looking at the model config. Model loading from the weights is a separate step
I mean to understand, is there anything wrong while quantizing the model or something is wrong while loading the model using vLLM?
Not easily at all. We commonly rely on the assumption that we can allocate and distribute the model parameters by looking at the model config. Model loading from the weights is a separate step
Hmm, a more practical way might be to let the user specify additional config arguments via CLI then...
@bhavyajoshi-mahindra the issue is that AutoGPTQ will not quantize the visual section of qwen2-vl, but it does not leave anything in the config to signify that that linear layers are skipped
@DarkLight1337 I think we should simply add a special case for GPTQ models, like was done here for AWQ https://github.com/vllm-project/vllm/blob/5608e611c2116cc17c6808b2ae1ecb4a3e263493/vllm/model_executor/models/internvl.py#L453-L463
@bhavyajoshi-mahindra the issue is that AutoGPTQ will not quantize the visual section of qwen2-vl, but it does not leave anything in the config to signify that that linear layers are skipped
@DarkLight1337 I think we should simply add a special case for GPTQ models, like was done here for AWQ
That may work for now. Does AWQ have an implicit list of modules that it quantized? What if this changes in the future?
The thread here seems to indicate that AWQ should work, but I get the same issue with AWQ version.
raise ValueError(f"Unexpected weight: {name}") from None
ValueError: Unexpected weight: visual.blocks.0.attn.proj.weight
Yet the layer is specified as unconverted in the config file:
"quantization_config": {
"bits": 4,
"group_size": 128,
"modules_to_not_convert": [
"visual"
],
"quant_method": "awq",
"version": "gemm",
"zero_point": true
},
I'm trying with latest main.
Thanks for testing @cedonley, it seems if you run vllm serve Qwen/Qwen2-VL-2B-Instruct-AWQ
it will fail with your error because awq_marlin isn't obeying the ignore list. However if you force the vanilla awq backend with vllm serve Qwen/Qwen2-VL-2B-Instruct-AWQ
, I am able to load the model fine. I will put a fix up for this!
Your current environment
I tried to infer my custom Qwen2-VL GPTQ 4bit model using the below code:
I got this error:
Note: 1) "Qwen2VLForConditionalGeneration" is in the list of supported models but still I got the error. 2) collect_env.py says "Is CUDA available: False" but nvcc --version mentions : "nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023 Cuda compilation tools, release 12.1, V12.1.66 Build cuda_12.1.r12.1/compiler.32415258_0"
Can anyone help me with this.
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
Before submitting a new issue...