This paper improves inference efficiency by determining the priority of each inference job based on the first output token generated by the profile. The scheduler is a skip-join multi-level feedback queue, which determines which queue to enter based on the time of generating the first token.
Trade-off: Not only do we need to maintain the K-V cache of running jobs, but also the K-V cache of all pending jobs. The system will offload low-priority jobs to host memory.
Details: The profiler will obtain the priority based on the time of the first output token of the job. For preempted jobs, it will return partial responses and reset the priority to prevent starvation. For cache management, actively uploading the most recently used jobs to the GPU, and heuristically predicting the number of jobs coming in recently. It also mentions distributed serving.
@simon-mo Is this a feature you'd like to see implemented?
π The feature, motivation and pitch
This paper might be of interest: https://arxiv.org/pdf/2305.05920.pdf
This paper improves inference efficiency by determining the priority of each inference job based on the first output token generated by the profile. The scheduler is a skip-join multi-level feedback queue, which determines which queue to enter based on the time of generating the first token.
Trade-off: Not only do we need to maintain the K-V cache of running jobs, but also the K-V cache of all pending jobs. The system will offload low-priority jobs to host memory.
Details: The profiler will obtain the priority based on the time of the first output token of the job. For preempted jobs, it will return partial responses and reset the priority to prevent starvation. For cache management, actively uploading the most recently used jobs to the GPU, and heuristically predicting the number of jobs coming in recently. It also mentions distributed serving.
@simon-mo Is this a feature you'd like to see implemented?
Alternatives
No response
Additional context
No response