[Performance]: Sampler account for most of time comparing to prefill and decode

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

30.85k stars 4.69k forks source link

[Performance]: Sampler account for most of time comparing to prefill and decode #9788

Open zhjunqin opened 4 weeks ago

zhjunqin commented 4 weeks ago

Your current environment

The output of `python collect_env.py`

```text Your output of `python collect_env.py` here ```

Model Input Dumps

No response

🐛 Describe the bug

Do profiling with qwen2vl, I found that sampler account for most of time during inference, am I right? Is it an issue?

preprocess: 10ms
vision encoder: 252ms
prefill 46ms + sampler 225ms
(decoding inference 2ms + decoding sampler 18ms ) * 62 = 1240ms

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

zhjunqin commented 4 weeks ago

Seems it trigger by sampling_metadata.skip_sampler_cpu_output. When to set skip_sampler_cpu_output to true?

    if not sampling_metadata.skip_sampler_cpu_output:
        # GPU<->CPU sync happens here.
        # This also converts the sampler output to a Python object.
        # Return Pythonized sampler result & sampled token ids
        return get_pythonized_sample_results(
            maybe_deferred_args), sampled_token_ids_tensor
    else:
        # Defer sampler result Pythonization; return deferred
        # Pythonization args & sampled token ids
        return (
            maybe_deferred_args,
            sampled_token_ids_tensor,
        )