vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.85k stars 4.69k forks source link

[Performance]: Sampler account for most of time comparing to prefill and decode #9788

Open zhjunqin opened 4 weeks ago

zhjunqin commented 4 weeks ago

Your current environment

The output of `python collect_env.py` ```text Your output of `python collect_env.py` here ```

Model Input Dumps

No response

🐛 Describe the bug

Do profiling with qwen2vl, I found that sampler account for most of time during inference, am I right? Is it an issue?

image

Before submitting a new issue...

zhjunqin commented 4 weeks ago

Seems it trigger by sampling_metadata.skip_sampler_cpu_output. When to set skip_sampler_cpu_output to true?

    if not sampling_metadata.skip_sampler_cpu_output:
        # GPU<->CPU sync happens here.
        # This also converts the sampler output to a Python object.
        # Return Pythonized sampler result & sampled token ids
        return get_pythonized_sample_results(
            maybe_deferred_args), sampled_token_ids_tensor
    else:
        # Defer sampler result Pythonization; return deferred
        # Pythonization args & sampled token ids
        return (
            maybe_deferred_args,
            sampled_token_ids_tensor,
        )

image