vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.88k stars 3.95k forks source link

[Misc]: _run_workers_async function of DistributedGPUExecutorAsync #6400

Open HMJW opened 2 months ago

HMJW commented 2 months ago

I am confused why _run_workers_async function of DistributedGPUExecutorAsync is removed since v0.4.3?

New implementation starts a loop for every worker which will restrict worker from doing other things such as transfering kv cache in prefill/decode disaggregation. I use _run_workers_async to transfer kv cache before without any problems but it will execute only when the loops of workers are stopped currently.

I am sorry that I am not familiar with asyncio in python. I want to know what the benefits of the new implementation are? And how to allow the workers to transfer kv asynchronously during generation?

youkaichao commented 2 months ago

should be related with https://github.com/vllm-project/vllm/pull/4894 .