Open zhjunqin opened 4 weeks ago
Seems it trigger by sampling_metadata.skip_sampler_cpu_output
. When to set skip_sampler_cpu_output to true?
if not sampling_metadata.skip_sampler_cpu_output:
# GPU<->CPU sync happens here.
# This also converts the sampler output to a Python object.
# Return Pythonized sampler result & sampled token ids
return get_pythonized_sample_results(
maybe_deferred_args), sampled_token_ids_tensor
else:
# Defer sampler result Pythonization; return deferred
# Pythonization args & sampled token ids
return (
maybe_deferred_args,
sampled_token_ids_tensor,
)
Your current environment
The output of `python collect_env.py`
```text Your output of `python collect_env.py` here ```Model Input Dumps
No response
🐛 Describe the bug
Do profiling with qwen2vl, I found that sampler account for most of time during inference, am I right? Is it an issue?
Before submitting a new issue...