vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.49k stars 3.69k forks source link

Qwen1.5-7B-Chat failed #2785

Closed wxz1996 closed 27 minutes ago

wxz1996 commented 6 months ago

Traceback (most recent call last): File "/home/orbbec/VLM/qwen/vllm_test.py", line 11, in <module> llm = LLM(model="/home/orbbec/VLM/qwen/model/qwen1.5/Qwen1.5-7B-Chat", File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 109, in __init__ self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 356, in from_engine_args engine = cls(*engine_configs, File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 114, in __init__ self._init_cache() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 331, in _init_cache raise ValueError( ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (1984). Try increasinggpu_memory_utilizationor decreasingmax_model_lenwhen initializing the engine.

I saw on the Qianwen official website that version 0.30.0 is supported. I tried it and found an error. May I ask what might have caused it?

wxz1996 commented 6 months ago

Qwen-7B-Chat can run properly

ray-008 commented 6 months ago

same error

kindernerd commented 6 months ago

qwen1.5 support larger max seq len, 32768, so it consumes more gpu memory by default, decrease the max seq len when starting or use larger gpu mem

ray-008 commented 6 months ago

just set this parameter when starting up :--max-model-len 8192