Thank you for your great work! May I ask about some details on the scheduler?
In paper, it is mentioned that "To minimize latency penalty, we limit the prefill batch size to 1 for each batch." So if multiple requests are at prefill stage, they will either be scheduled to different Runners or be in the first-arrive-first service queue. Is this understanding correct? By the way, May I know whether this scheduling code (for section 5.1 Scheduling new request) is released?
In figure 2, May I know the difference between runner and LLMs under a runner?
Thank you for your great work! May I ask about some details on the scheduler?
Looking forward to hearing from you~