vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.09k stars 3.82k forks source link

Mixtral GPTQ Long Prompts exceed capacity of block_manager #2198

Closed muc-martin closed 5 months ago

muc-martin commented 8 months ago

I loaded Mixtral TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ in VLLM successfully. Everything works for short prompts. However, longer prompts lead to the error:

WARNING 12-19 10:21:05 scheduler.py:161] Input prompt (1217 tokens) is too long and exceeds the capacity of block_manager
INFO 12-19 10:21:05 async_llm_engine.py:111] Finished request cmpl-b8c7b3b2eb7a4de488ef870004648708.

This is the command I use to start the model:

python -m vllm.entrypoints.openai.api_server --model TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --quantization gptq --dtype float16 --max-model-len 4096 --gpu-memory-utilization 0.99

The model starts with:

INFO 12-19 10:20:40 api_server.py:719] args: Namespace(allow_credentials=False, allowed_headers=['*'], allowed_methods=['*'], allowed_origins=['*'], block_size=16, chat_template=None, disable_log_requests=False, disable_log_stats=False, download_dir=None, dtype='float16', enforce_eager=False, engine_use_ray=False, gpu_memory_utilization=0.99, host=None, load_format='auto', max_context_len_to_capture=8192, max_log_len=None, max_model_len=4096, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, max_parallel_loading_workers=None, model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', pipeline_parallel_size=1, port=8000, quantization='gptq', response_role='assistant', revision=None, seed=0, served_model_name=None, swap_space=4, tensor_parallel_size=1, tokenizer=None, tokenizer_mode='auto', tokenizer_revision=None, trust_remote_code=False, worker_use_ray=False)
WARNING 12-19 10:20:40 config.py:467] Casting torch.bfloat16 to torch.float16.
WARNING 12-19 10:20:40 config.py:179] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.                                                                           WARNING 12-19 10:20:40 config.py:191] gptq does not support CUDA graph yet. Disabling CUDA graph.                                                                                                                INFO 12-19 10:20:40 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer='TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=gptq, enforce_eager=True, seed=0)    INFO 12-19 10:20:55 llm_engine.py:223] # GPU blocks: 62, # CPU blocks: 2048                                                                                                                                      INFO 12-19 10:20:57 api_server.py:113] Using default chat template:                                                                                                                                              INFO 12-19 10:20:57 api_server.py:113] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
INFO:     Started server process [3268]
INFO:     Waiting for application startup.                                                                                                                                                                       INFO:     Application startup complete.                                                                                                                                                                          INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) 

Running on a RTX3090

xiong0827 commented 8 months ago

+1

carlosandrea commented 8 months ago

From my understanding, your prompt token size cannot exceed : GPU block BlockSize. For example, running TheBloke/Phind-CodeLlama-34B-v2-GPTQ on a RTX4090 : Max_prompt_length = block_size(16) gpu_block(123)=1968 token. gpu_block can be increase with gpu-memory-utilization

DaBossCoda commented 8 months ago

getting same error with a yi awq

Wimeremce7 commented 5 months ago

调大了gpu-memory-utilization的值,解决了这个问题