vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.12k stars 4.55k forks source link

[Bug]: The throughput computation in metric.py seems wrong #10261

Closed Achazwl closed 1 day ago

Achazwl commented 1 day ago

Your current environment

The output of `python collect_env.py` ```text Your output of `python collect_env.py` here ```

Model Input Dumps

No response

🐛 Describe the bug

Seems that prefill throughput and decode throughput are both divided by the overall time, i.e., $$\text{prefill throughput} = {\text{num of input tokens}\over\text{prefill time + decode time}}$$ $$\text{decode throughput} = {\text{num of output tokens}\over\text{prefill time + decode time}}$$. but should be $$\text{prefill throughput} = {\text{num of input tokens}\over\text{prefill time}}$$ $$\text{decode throughput} = {\text{num of output tokens}\over\text{decode time}}$$.

This will significantly affect the model performance data in scenarios with long inputs and long outputs.

See https://github.com/vllm-project/vllm/blob/main/vllm/engine/metrics.py#L440C1-L446C46 and https://github.com/vllm-project/vllm/blob/main/vllm/engine/metrics.py#L416C1-L418C59

prompt_throughput = get_throughput(self.num_prompt_tokens,
                                               now=stats.now,
                                               last_log=self.last_local_log)
generation_throughput = get_throughput(
                self.num_generation_tokens,
                now=stats.now,
                last_log=self.last_local_log)
def get_throughput(tracked_stats: List[int], now: float,
                   last_log: float) -> float:
    return float(np.sum(tracked_stats) / (now - last_log))

Before submitting a new issue...

Achazwl commented 1 day ago

Oh, it seems there is no better solution for batch serving. Should disable the related output when using benchmarks/benchmark_latency.py