Open candyzone opened 2 months ago
what do you mean by model weight loading
? during inference, model weights are already in GPU.
what do you mean by
model weight loading
? during inference, model weights are already in GPU.
model weight loading
is the model initialization stage (code), the overlap is loading model weight and executing model first time(prefill stage).
The idea is illustrated below.
@youkaichao what do you think about this idea? It needs some work on vLLM engine to extend model loader.
I still don't get it. What do you mean by loading model weight
? from where to where?
I still don't get it. What do you mean by
loading model weight
? from where to where?
loading model weight
is load model Parameter from Disk to GPU. That is a prerequisite for model Inference.
we don't store weights/kv cache on disk.
we don't store weights/kv cache on disk.
at the begining, the model store on disk, first step is reading model(safetensors/pt) to DRAM, then cudaMemcpy to GPU second step is to infer LLM model(prefill / decode stage).
I mean the loading model weight
is the first step.
🚀 The feature, motivation and pitch
For LLM inference, requests per second(QPS) is not constant. It needs launch vllm engine on demand. For elastic instance, it's significance to reduce TTFT(Time to First Token). Hence, it's necessary to overlap model loading and prefill, especially very large model which model loading costs several seconds.
Alternatives
Additional context
No response