vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.41k stars 3.67k forks source link

can model Qwen/Qwen-VL-Chat work well? #962

Open wangschang opened 11 months ago

wangschang commented 11 months ago

when i use Qwen/Qwen-VL-Chat I do not know why!

throw a error

Traceback (most recent call last): File "test.py", line 20, in <module> model = LLM(model=model_path, tokenizer=model_path,tokenizer_mode='slow',tensor_parallel_size=1,trust_remote_code=True) File "/usr/local/miniconda3/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 66, in __init__ self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/usr/local/miniconda3/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 220, in from_engine_args engine = cls(*engine_configs, File "/usr/local/miniconda3/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 101, in __init__ self._init_workers(distributed_init_method) File "/usr/local/miniconda3/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 133, in _init_workers self._run_workers( File "/usr/local/miniconda3/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 470, in _run_workers output = executor(*args, **kwargs) File "/usr/local/miniconda3/lib/python3.8/site-packages/vllm/worker/worker.py", line 67, in init_model self.model = get_model(self.model_config) File "/usr/local/miniconda3/lib/python3.8/site-packages/vllm/model_executor/model_loader.py", line 57, in get_model model.load_weights(model_config.model, model_config.download_dir, File "/usr/local/miniconda3/lib/python3.8/site-packages/vllm/model_executor/models/qwen.py", line 308, in load_weights param = state_dict[name] KeyError: 'transformer.visual.positional_embedding'

the code is

`from vllm import LLM, SamplingParams from transformers import AutoModelForCausalLM, AutoTokenizer,AutoConfig import time

model_path="Qwen/Qwen-VL-Chat"

model = LLM(model=model_path, tokenizer=model_path,tokenizer_mode='slow',tensor_parallel_size=1,trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_path, legacy=True, trust_remote_code=True)

sampling_params = SamplingParams(temperature=0,max_tokens=8096) start=time.time() prompts = ["你好!"] outputs = model.generate(prompts, sampling_params) end = time.time() for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text length = len(generated_text) print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(end-start) cost = end-start print(f"{length/cost}tokens/s")`

iFe1er commented 9 months ago

same question here

hntee commented 8 months ago

Same issue. I think this is because model_executor/models/qwen.py is only for Qwen-7B-Chat, not compatible with Qwen-VL. https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/qwen.py

ruifengma commented 5 months ago

same issue

linzm1007 commented 3 months ago

有解决了没

NDDec commented 3 months ago

same issue

hmellor commented 2 months ago

Please stop saying same issue, just react to the original message to show your support

DamonFool commented 2 months ago

For text-only inputs, we can run the model with this patch https://github.com/vllm-project/vllm/pull/5710 .

alex-jw-brooks commented 2 weeks ago

I am looking into adding support for image inputs for Qwen-VL/Qwen-VL-Chat 😄