vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.5k stars 4.62k forks source link

[Bug]: Does vLLM support Qwen/Qwen1.5-32B-Chat-AWQ? It works for the first time then stops generating responses. #3872

Open sungkim11 opened 7 months ago

sungkim11 commented 7 months ago

Your current environment

vllm docker image: vllm/vllm-openai:latest

🐛 Describe the bug

It works for the first time then stops generating responses, as shown below.

ChatCompletion(id='cmpl-19b57e1ef1dc41edb57f37fa9bb66151', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='\n\n', role='assistant', function_call=None, tool_calls=None), stop_reason=None)], created=1712343257, model='Qwen/Qwen1.5-32B-Chat-AWQ', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=2, prompt_tokens=78, total_tokens=80))

sungkim11 commented 7 months ago

Same problem with Qwen/Qwen1.5-32B-Chat-GPTQ-Int4

exceedzhang commented 7 months ago

Encountering the same problem, this situation occurs when using a streaming interface!

image
exceedzhang commented 7 months ago

[OK]test_completion(model, logprob) [OK]test_completion_stream(model)

[ERROR]test_chat_completion(model) [ERROR]test_chat_completion_stream(model)

kaixindelele commented 7 months ago

请问你们用qwen32B-gptq-int4配合vllm,需要多少G的显存哈?我用3090的24G总是启动不了。

sungkim11 commented 7 months ago

you will need 2 or more

kaixindelele commented 7 months ago

哈哈,懂了,还是我太穷了,我一直幻想只是因为我没有配置好,原来是真的不行~~

lee0v0 commented 7 months ago

哈哈,懂了,还是我太穷了,我一直幻想只是因为我没有配置好,原来是真的不行~~

可以的,24G 4090可以启动,max-model-len=4000,但是会出现偶尔抽风输出全是感叹号,你可以试试

kaixindelele commented 7 months ago

哈哈,懂了,还是我太穷了,我一直幻想只是因为我没有配置好,原来是真的不行~~

可以的,24G 4090可以启动,max-model-len=4000,但是会出现偶尔抽风输出全是感叹号,你可以试试

有推荐的启动命令行嘛,我用下面的命令行,在3090上,没启动成功: python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1.5-32B-Chat-GPTQ-Int4 --model /media/lyl/data7/qwen_data/Qwen1.5-32B-Chat-GPTQ-Int4 --gpu-memory-utilization 0.9 --max-model-len 512 --port 9898

lee0v0 commented 7 months ago

哈哈,懂了,还是我太穷了,我一直幻想只是因为我没有配置好,原来是真的不行~~

可以的,24G 4090可以启动,max-model-len=4000,但是会出现偶尔抽风输出全是感叹号,你可以试试

有推荐的启动命令行嘛,我用下面的命令行,在3090上,没启动成功: python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1.5-32B-Chat-GPTQ-Int4 --model /media/lyl/data7/qwen_data/Qwen1.5-32B-Chat-GPTQ-Int4 --gpu-memory-utilization 0.9 --max-model-len 512 --port 9898

python -m vllm.entrypoints.openai.api_server --model ./models/Qwen/Qwen1.5-32B-Chat-GPTQ-Int4 --quantization gptq --max-model-len 2048 --served-model-name qwen --port 9000 --max-num-seq=2 --gpu-memory-utilization 0.92,你试试

kaixindelele commented 7 months ago

感谢感谢,试了一下,果然能跑通了,不知道是--quantization gptq的作用,还是--max-num-seq=2,我最大可以设置到--max-num-seq=5,但还是太慢了,我网站还是切回去14B算了,放弃挣扎了。再次感谢!

KrianJ commented 7 months ago

哈哈,懂了,还是我太穷了,我一直幻想只是因为我没有配置好,原来是真的不行~~

可以的,24G 4090可以启动,max-model-len=4000,但是会出现偶尔抽风输出全是感叹号,你可以试试

有推荐的启动命令行嘛,我用下面的命令行,在3090上,没启动成功: python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1.5-32B-Chat-GPTQ-Int4 --model /media/lyl/data7/qwen_data/Qwen1.5-32B-Chat-GPTQ-Int4 --gpu-memory-utilization 0.9 --max-model-len 512 --port 9898

python -m vllm.entrypoints.openai.api_server --model ./models/Qwen/Qwen1.5-32B-Chat-GPTQ-Int4 --quantization gptq --max-model-len 2048 --served-model-name qwen --port 9000 --max-num-seq=2 --gpu-memory-utilization 0.92,你试试

为啥同样的命令我在A800服务器上部署会暴增到74G🥲

INFO 04-11 18:00:30 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-11 18:00:30 api_server.py:150] args: Namespace(host=None, port=13280, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='qwen', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=2048, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.92, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=2, max_logprobs=5, disable_log_stats=False, quantization='gptq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 04-11 18:00:30 config.py:211] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-11 18:00:30 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-11 18:00:31 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-11 18:00:31 selector.py:25] Using XFormers backend.
INFO 04-11 18:00:40 model_runner.py:104] Loading model weights took 18.1536 GB
INFO 04-11 18:00:43 gpu_executor.py:94] # GPU blocks: 13690, # CPU blocks: 1024
INFO 04-11 18:00:44 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-11 18:00:44 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-11 18:00:45 model_runner.py:867] Graph capturing finished in 0 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-11 18:00:45 serving_chat.py:331] Using default chat template:
INFO 04-11 18:00:45 serving_chat.py:331] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 04-11 18:00:45 serving_chat.py:331] You are a helpful assistant<|im_end|>
INFO 04-11 18:00:45 serving_chat.py:331] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 04-11 18:00:45 serving_chat.py:331] ' + message['content'] + '<|im_end|>' + '
INFO 04-11 18:00:45 serving_chat.py:331] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 04-11 18:00:45 serving_chat.py:331] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

image

VeryVery commented 7 months ago

哈哈,懂了,还是我太穷了,我一直幻想只是因为我没有配置好,原来是真的不行~~

可以的,24G 4090可以启动,max-model-len=4000,但是会出现偶尔抽风输出全是感叹号,你可以试试

有推荐的启动命令行嘛,我用下面的命令行,在3090上,没启动成功: python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1.5-32B-Chat-GPTQ-Int4 --model /media/lyl/data7/qwen_data/Qwen1.5-32B-Chat-GPTQ-Int4 --gpu-memory-utilization 0.9 --max-model-len 512 --port 9898

python -m vllm.entrypoints.openai.api_server --model ./models/Qwen/Qwen1.5-32B-Chat-GPTQ-Int4 --quantization gptq --max-model-len 2048 --served-model-name qwen --port 9000 --max-num-seq=2 --gpu-memory-utilization 0.92,你试试

为啥同样的命令我在A800服务器上部署会暴增到74G🥲

INFO 04-11 18:00:30 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-11 18:00:30 api_server.py:150] args: Namespace(host=None, port=13280, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='qwen', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=2048, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.92, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=2, max_logprobs=5, disable_log_stats=False, quantization='gptq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 04-11 18:00:30 config.py:211] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-11 18:00:30 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-11 18:00:31 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-11 18:00:31 selector.py:25] Using XFormers backend.
INFO 04-11 18:00:40 model_runner.py:104] Loading model weights took 18.1536 GB
INFO 04-11 18:00:43 gpu_executor.py:94] # GPU blocks: 13690, # CPU blocks: 1024
INFO 04-11 18:00:44 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-11 18:00:44 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-11 18:00:45 model_runner.py:867] Graph capturing finished in 0 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-11 18:00:45 serving_chat.py:331] Using default chat template:
INFO 04-11 18:00:45 serving_chat.py:331] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 04-11 18:00:45 serving_chat.py:331] You are a helpful assistant<|im_end|>
INFO 04-11 18:00:45 serving_chat.py:331] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 04-11 18:00:45 serving_chat.py:331] ' + message['content'] + '<|im_end|>' + '
INFO 04-11 18:00:45 serving_chat.py:331] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 04-11 18:00:45 serving_chat.py:331] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

image

because 80G * 0.92 = 73.6G

KrianJ commented 7 months ago

哈哈,懂了,还是我太穷了,我一直幻想只是因为我没有配置好,原来是真的不行~~

可以的,24G 4090可以启动,max-model-len=4000,但是会出现偶尔抽风输出全是感叹号,你可以试试

有推荐的启动命令行嘛,我用下面的命令行,在3090上,没启动成功: python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1.5-32B-Chat-GPTQ-Int4 --model /media/lyl/data7/qwen_data/Qwen1.5-32B-Chat-GPTQ-Int4 --gpu-memory-utilization 0.9 --max-model-len 512 --port 9898

python -m vllm.entrypoints.openai.api_server --model ./models/Qwen/Qwen1.5-32B-Chat-GPTQ-Int4 --quantization gptq --max-model-len 2048 --served-model-name qwen --port 9000 --max-num-seq=2 --gpu-memory-utilization 0.92,你试试

为啥同样的命令我在A800服务器上部署会暴增到74G🥲

INFO 04-11 18:00:30 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-11 18:00:30 api_server.py:150] args: Namespace(host=None, port=13280, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='qwen', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=2048, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.92, forced_num_gpu_blocks=None, max_num_batched_tokens=None, max_num_seqs=2, max_logprobs=5, disable_log_stats=False, quantization='gptq', enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 04-11 18:00:30 config.py:211] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 04-11 18:00:30 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer='./data/models/Qwen1.5-32B-Chat-GPTQ-Int4', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-11 18:00:31 selector.py:51] Cannot use FlashAttention because the package is not found. Please install it for better performance.
INFO 04-11 18:00:31 selector.py:25] Using XFormers backend.
INFO 04-11 18:00:40 model_runner.py:104] Loading model weights took 18.1536 GB
INFO 04-11 18:00:43 gpu_executor.py:94] # GPU blocks: 13690, # CPU blocks: 1024
INFO 04-11 18:00:44 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-11 18:00:44 model_runner.py:795] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 04-11 18:00:45 model_runner.py:867] Graph capturing finished in 0 secs.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 04-11 18:00:45 serving_chat.py:331] Using default chat template:
INFO 04-11 18:00:45 serving_chat.py:331] {% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
INFO 04-11 18:00:45 serving_chat.py:331] You are a helpful assistant<|im_end|>
INFO 04-11 18:00:45 serving_chat.py:331] ' }}{% endif %}{{'<|im_start|>' + message['role'] + '
INFO 04-11 18:00:45 serving_chat.py:331] ' + message['content'] + '<|im_end|>' + '
INFO 04-11 18:00:45 serving_chat.py:331] '}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
INFO 04-11 18:00:45 serving_chat.py:331] ' }}{% endif %}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

image

because 80G * 0.92 = 73.6G

🫠感谢,我还以为gpu-memory-utilization分配的是CUDA graph的推理显存占比,理解错误了。。

davidjia1972 commented 6 months ago

哈哈,懂了,还是我太穷了,我一直幻想只是因为我没有配置好,原来是真的不行~~

可以的,24G 4090可以启动,max-model-len=4000,但是会出现偶尔抽风输出全是感叹号,你可以试试

唉,我这里是只要稍微复杂一点儿的问题就是一堆感叹号。用 14B Int 4就一切正常

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!