vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.99k stars 3.96k forks source link

[Bug]: Does vLLM support Qwen/Qwen1.5-32B-Chat-AWQ? It works for the first time then stops generating responses. #3872

Open sungkim11 opened 5 months ago

sungkim11 commented 5 months ago

Your current environment

vllm docker image: vllm/vllm-openai:latest

🐛 Describe the bug

It works for the first time then stops generating responses, as shown below.

ChatCompletion(id='cmpl-19b57e1ef1dc41edb57f37fa9bb66151', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='\n\n', role='assistant', function_call=None, tool_calls=None), stop_reason=None)], created=1712343257, model='Qwen/Qwen1.5-32B-Chat-AWQ', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=2, prompt_tokens=78, total_tokens=80))

sungkim11 commented 5 months ago

Same problem with Qwen/Qwen1.5-32B-Chat-GPTQ-Int4

exceedzhang commented 5 months ago

Encountering the same problem, this situation occurs when using a streaming interface!

image
exceedzhang commented 5 months ago

[OK]test_completion(model, logprob) [OK]test_completion_stream(model)

[ERROR]test_chat_completion(model) [ERROR]test_chat_completion_stream(model)

kaixindelele commented 5 months ago

请问你们用qwen32B-gptq-int4配合vllm,需要多少G的显存哈?我用3090的24G总是启动不了。

sungkim11 commented 5 months ago

you will need 2 or more

kaixindelele commented 5 months ago

哈哈,懂了,还是我太穷了,我一直幻想只是因为我没有配置好,原来是真的不行~~

lee0v0 commented 5 months ago

哈哈,懂了,还是我太穷了,我一直幻想只是因为我没有配置好,原来是真的不行~~

可以的,24G 4090可以启动,max-model-len=4000,但是会出现偶尔抽风输出全是感叹号,你可以试试

kaixindelele commented 5 months ago

哈哈,懂了,还是我太穷了,我一直幻想只是因为我没有配置好,原来是真的不行~~

可以的,24G 4090可以启动,max-model-len=4000,但是会出现偶尔抽风输出全是感叹号,你可以试试

有推荐的启动命令行嘛,我用下面的命令行,在3090上,没启动成功: python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1.5-32B-Chat-GPTQ-Int4 --model /media/lyl/data7/qwen_data/Qwen1.5-32B-Chat-GPTQ-Int4 --gpu-memory-utilization 0.9 --max-model-len 512 --port 9898

lee0v0 commented 5 months ago

哈哈,懂了,还是我太穷了,我一直幻想只是因为我没有配置好,原来是真的不行~~

可以的,24G 4090可以启动,max-model-len=4000,但是会出现偶尔抽风输出全是感叹号,你可以试试

有推荐的启动命令行嘛,我用下面的命令行,在3090上,没启动成功: python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1.5-32B-Chat-GPTQ-Int4 --model /media/lyl/data7/qwen_data/Qwen1.5-32B-Chat-GPTQ-Int4 --gpu-memory-utilization 0.9 --max-model-len 512 --port 9898

python -m vllm.entrypoints.openai.api_server --model ./models/Qwen/Qwen1.5-32B-Chat-GPTQ-Int4 --quantization gptq --max-model-len 2048 --served-model-name qwen --port 9000 --max-num-seq=2 --gpu-memory-utilization 0.92,你试试

kaixindelele commented 5 months ago

感谢感谢,试了一下,果然能跑通了,不知道是--quantization gptq的作用,还是--max-num-seq=2,我最大可以设置到--max-num-seq=5,但还是太慢了,我网站还是切回去14B算了,放弃挣扎了。再次感谢!

KrianJ commented 5 months ago

哈哈,懂了,还是我太穷了,我一直幻想只是因为我没有配置好,原来是真的不行~~

可以的,24G 4090可以启动,max-model-len=4000,但是会出现偶尔抽风输出全是感叹号,你可以试试

有推荐的启动命令行嘛,我用下面的命令行,在3090上,没启动成功: python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1.5-32B-Chat-GPTQ-Int4 --model /media/lyl/data7/qwen_data/Qwen1.5-32B-Chat-GPTQ-Int4 --gpu-memory-utilization 0.9 --max-model-len 512 --port 9898

python -m vllm.entrypoints.openai.api_server --model ./models/Qwen/Qwen1.5-32B-Chat-GPTQ-Int4 --quantization gptq --max-model-len 2048 --served-model-name qwen --port 9000 --max-num-seq=2 --gpu-memory-utilization 0.92,你试试

为啥同样的命令我在A800服务器上部署会暴增到74G🥲

INFO 04-11 18:00:30 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-11 18:00:30 api_server.py:150] args: Namespace(host=None, port=13280, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='qwen', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=2048, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.92, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=2, max_logprobs=5, disable_log_stats=False, quantization='gptq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 04-11 18:00:30 config.py:211] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-11 18:00:30 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-11 18:00:31 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-11 18:00:31 selector.py:25] Using XFormers backend.
INFO 04-11 18:00:40 model_runner.py:104] Loading model weights took 18.1536 GB
INFO 04-11 18:00:43 gpu_executor.py:94] # GPU blocks: 13690, # CPU blocks: 1024
INFO 04-11 18:00:44 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-11 18:00:44 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-11 18:00:45 model_runner.py:867] Graph capturing finished in 0 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-11 18:00:45 serving_chat.py:331] Using default chat template:
INFO 04-11 18:00:45 serving_chat.py:331] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 04-11 18:00:45 serving_chat.py:331] You are a helpful assistant<|im_end|>
INFO 04-11 18:00:45 serving_chat.py:331] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 04-11 18:00:45 serving_chat.py:331] ' + message['content'] + '<|im_end|>' + '
INFO 04-11 18:00:45 serving_chat.py:331] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 04-11 18:00:45 serving_chat.py:331] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

image

VeryVery commented 5 months ago

哈哈,懂了,还是我太穷了,我一直幻想只是因为我没有配置好,原来是真的不行~~

可以的,24G 4090可以启动,max-model-len=4000,但是会出现偶尔抽风输出全是感叹号,你可以试试

有推荐的启动命令行嘛,我用下面的命令行,在3090上,没启动成功: python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1.5-32B-Chat-GPTQ-Int4 --model /media/lyl/data7/qwen_data/Qwen1.5-32B-Chat-GPTQ-Int4 --gpu-memory-utilization 0.9 --max-model-len 512 --port 9898

python -m vllm.entrypoints.openai.api_server --model ./models/Qwen/Qwen1.5-32B-Chat-GPTQ-Int4 --quantization gptq --max-model-len 2048 --served-model-name qwen --port 9000 --max-num-seq=2 --gpu-memory-utilization 0.92,你试试

为啥同样的命令我在A800服务器上部署会暴增到74G🥲

INFO 04-11 18:00:30 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-11 18:00:30 api_server.py:150] args: Namespace(host=None, port=13280, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='qwen', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=2048, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.92, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=2, max_logprobs=5, disable_log_stats=False, quantization='gptq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 04-11 18:00:30 config.py:211] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-11 18:00:30 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-11 18:00:31 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-11 18:00:31 selector.py:25] Using XFormers backend.
INFO 04-11 18:00:40 model_runner.py:104] Loading model weights took 18.1536 GB
INFO 04-11 18:00:43 gpu_executor.py:94] # GPU blocks: 13690, # CPU blocks: 1024
INFO 04-11 18:00:44 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-11 18:00:44 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-11 18:00:45 model_runner.py:867] Graph capturing finished in 0 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-11 18:00:45 serving_chat.py:331] Using default chat template:
INFO 04-11 18:00:45 serving_chat.py:331] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 04-11 18:00:45 serving_chat.py:331] You are a helpful assistant<|im_end|>
INFO 04-11 18:00:45 serving_chat.py:331] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 04-11 18:00:45 serving_chat.py:331] ' + message['content'] + '<|im_end|>' + '
INFO 04-11 18:00:45 serving_chat.py:331] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 04-11 18:00:45 serving_chat.py:331] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

image

because 80G * 0.92 = 73.6G

KrianJ commented 5 months ago

哈哈,懂了,还是我太穷了,我一直幻想只是因为我没有配置好,原来是真的不行~~

可以的,24G 4090可以启动,max-model-len=4000,但是会出现偶尔抽风输出全是感叹号,你可以试试

有推荐的启动命令行嘛,我用下面的命令行,在3090上,没启动成功: python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1.5-32B-Chat-GPTQ-Int4 --model /media/lyl/data7/qwen_data/Qwen1.5-32B-Chat-GPTQ-Int4 --gpu-memory-utilization 0.9 --max-model-len 512 --port 9898

python -m vllm.entrypoints.openai.api_server --model ./models/Qwen/Qwen1.5-32B-Chat-GPTQ-Int4 --quantization gptq --max-model-len 2048 --served-model-name qwen --port 9000 --max-num-seq=2 --gpu-memory-utilization 0.92,你试试

为啥同样的命令我在A800服务器上部署会暴增到74G🥲

INFO 04-11 18:00:30 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-11 18:00:30 api_server.py:150] args: Namespace(host=None, port=13280, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='qwen', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=2048, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.92, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=2, max_logprobs=5, disable_log_stats=False, quantization='gptq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 04-11 18:00:30 config.py:211] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-11 18:00:30 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-11 18:00:31 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-11 18:00:31 selector.py:25] Using XFormers backend.
INFO 04-11 18:00:40 model_runner.py:104] Loading model weights took 18.1536 GB
INFO 04-11 18:00:43 gpu_executor.py:94] # GPU blocks: 13690, # CPU blocks: 1024
INFO 04-11 18:00:44 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-11 18:00:44 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-11 18:00:45 model_runner.py:867] Graph capturing finished in 0 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-11 18:00:45 serving_chat.py:331] Using default chat template:
INFO 04-11 18:00:45 serving_chat.py:331] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 04-11 18:00:45 serving_chat.py:331] You are a helpful assistant<|im_end|>
INFO 04-11 18:00:45 serving_chat.py:331] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 04-11 18:00:45 serving_chat.py:331] ' + message['content'] + '<|im_end|>' + '
INFO 04-11 18:00:45 serving_chat.py:331] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 04-11 18:00:45 serving_chat.py:331] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

image

because 80G * 0.92 = 73.6G

🫠感谢,我还以为gpu-memory-utilization分配的是CUDA graph的推理显存占比,理解错误了。。

davidjia1972 commented 4 months ago

哈哈,懂了,还是我太穷了,我一直幻想只是因为我没有配置好,原来是真的不行~~

可以的,24G 4090可以启动,max-model-len=4000,但是会出现偶尔抽风输出全是感叹号,你可以试试

唉,我这里是只要稍微复杂一点儿的问题就是一堆感叹号。用 14B Int 4就一切正常