vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.08k stars 3.82k forks source link

[Feature]: multi-steps model_runner? #5055

Open leiwen83 opened 3 months ago

leiwen83 commented 3 months ago

🚀 The feature, motivation and pitch

Currently, in GPUExecutorAsync's execute_model_async, it use make_async, which bring some schedule cost. Small model would be more suffering from it, like 0.5B may take 20% cost, and 14B-int4 model take about 5%.

So I am thinking whether we could have something like decode burst mode? Thus we may output not single token, but >1? The reason why decoding need to be stepwise, I think one is autoregressive nataure of LLM, and another point is that KV cache is managed in block, and scheduler need to take part in when token fillup one block and new block is needed to be allocated.

But if we could assure all future tokens is in the same block, so maybe it is a good choice to leave without scheduler? Like current spec_decode's multi_step_worker did, it could be simply run the model_runner's execute_model several times.

Is there any other concerns for if making model_runner as multi-steps?

Alternatives

No response

Additional context

No response

chenzhengda commented 3 months ago

I think that if CUDA Graph can be integrated for the multi-decoding-step-runner, it should bring more benefits.