[Feature]: Overlap model weight loading and model prefill

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

29.74k stars 4.49k forks source link

[Feature]: Overlap model weight loading and model prefill #7690

Open candyzone opened 2 months ago

candyzone commented 2 months ago

🚀 The feature, motivation and pitch

For LLM inference, requests per second(QPS) is not constant. It needs launch vllm engine on demand. For elastic instance, it's significance to reduce TTFT(Time to First Token). Hence, it's necessary to overlap model loading and prefill, especially very large model which model loading costs several seconds.

Alternatives

Model cache system for vllm: efficient IO (such as: PCIe, NVLink)
Load model weight by LLM topo order and execute model immediately after parameter is ready

Additional context

No response

youkaichao commented 2 months ago

what do you mean by model weight loading ? during inference, model weights are already in GPU.

candyzone commented 2 months ago

what do you mean by model weight loading ? during inference, model weights are already in GPU.

model weight loading is the model initialization stage (code), the overlap is loading model weight and executing model first time(prefill stage).

The idea is illustrated below. LLM fast loading

candyzone commented 2 months ago

@youkaichao what do you think about this idea? It needs some work on vLLM engine to extend model loader.

youkaichao commented 2 months ago

I still don't get it. What do you mean by loading model weight? from where to where?

candyzone commented 2 months ago

I still don't get it. What do you mean by loading model weight? from where to where?

loading model weight is load model Parameter from Disk to GPU. That is a prerequisite for model Inference.

youkaichao commented 2 months ago

we don't store weights/kv cache on disk.

candyzone commented 2 months ago

we don't store weights/kv cache on disk.

at the begining, the model store on disk, first step is reading model(safetensors/pt) to DRAM, then cudaMemcpy to GPU second step is to infer LLM model(prefill / decode stage). I mean the loading model weight is the first step.