vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.04k stars 3.27k forks source link

[Bug]: vllm with gemma7b still slow #4966

Open adogwangwang opened 1 month ago

adogwangwang commented 1 month ago

Your current environment

python3.11 vllm4.1 torch2.21-cu118

🐛 Describe the bug

here is my log with vllm, when inference gemma7b ,it shows 6 logs for one request which spend 30s , why so slow?

INFO 05-22 02:20:59 async_llm_engine.py:524] Received request 10390bd4d2dc4936bd1a62e5793a4fd8: prompt: '<bos><start_of_turn>user\n请讲一下python语言的特点<end_of_turn>\n<start_of_turn>model\n', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.8, top_p=0.8, top_k=1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[1], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: None, lora_request: None.
INFO 05-22 02:20:59 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%
INFO 05-22 02:21:04 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 12.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%
INFO 05-22 02:21:09 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.6%, CPU KV cache usage: 0.0%
INFO 05-22 02:21:14 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.9%, CPU KV cache usage: 0.0%
INFO 05-22 02:21:19 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.2%, CPU KV cache usage: 0.0%
INFO 05-22 02:21:24 metrics.py:229] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 13.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 1.5%, CPU KV cache usage: 0.0%
INFO 05-22 02:21:29 async_llm_engine.py:120] Finished request 10390bd4d2dc4936bd1a62e5793a4fd8.

I use python3.11 ,vllm 4.1 ,torch 2.21-cu118

prompt =  '请讲解一下python语言的特点'
top_p =  0.8
top_k =  1
temperature = 0.8
max_length =  4096
stream =  True
sampling_params = SamplingParams(top_k=top_k,
                                 stop_token_ids=stop_token_ids,
                                 stop=stop,
                                 top_p=top_p,
                                 temperature=temperature,
                                 max_tokens=max_length)
chat = [
        {"role": "user", "content": prompt},
    ]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
# input_token_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").tolist()
request_id = random_uuid()

results_generator = engine.generate(prompt, sampling_params, request_id)

# Streaming case
async def stream_results() -> AsyncGenerator[bytes, None]:
    async for request_output in results_generator:
        outputs = request_output.outputs
        token_id = outputs[-1].token_ids[-1]
        _, token, _, _ = detokenizer.detokenize_incrementally(tokenizer, prefix_offset=0,read_offset=0,all_input_ids=[token_id],prev_tokens=None)
        ret = {"text": token}
        yield json.dumps(ret) + "\r\r"

# if stream:
#     return StreamingResponse(stream_results())

# Non-streaming case
final_output = None
async for request_output in results_generator:
    final_output = request_output

assert final_output is not None
prompt = final_output.prompt
text_outputs = [prompt + output.text for output in final_output.outputs]
text_outputs = [output.text for output in final_output.outputs]
ret = {"text": text_outputs}   
mgoin commented 1 month ago

Hi @adogwangwang the request is taking a long time because you are generating many tokens due to max_length = 4096. This setting doesn't control the context length of the model and is instead a request-level parameter that decides how many tokens to generate. It is likely you are generating ~4000 tokens, which takes a long time for any engine. If you only want to generate a small amount of tokens, please set max_length to that number