Open man2machine opened 5 months ago
I see there is the function get_num_unfinished_requests
. It appears this tells us whether there are requests running, swapped or waiting. However, I want to know if the engine is "full" as in if it takes in another request, that request will be queued up and will not be processed immediately. Can I do that by using bool(self.scheduler.waiting) or bool(self.scheduler.swapped)
?
In particular I wrote these two functions. Is this the right way to use the engine and scheduler classes or am I missing something?
And is there a way to know the maximum number of requests that can run in parallel?
class _CustomAsyncLLMEngine(_AsyncLLMEngine):
def has_requests_in_queue(
self: Self
) -> bool:
return bool(self.scheduler.waiting) or bool(self.scheduler.swapped)
def get_num_tokens_in_queue(
self: Self
) -> int:
num_tokens = sum(
sum(
sum(
len(s.data.prompt_token_ids)
for s in g.get_seqs()
)
for g in queue
)
for queue in (self.scheduler.waiting, self.scheduler.swapped)
)
return num_tokens
scheduler_config.max_num_seqs
will hold the maximum number of running sequences
Curious - why do you need this info? The user should not have to know about this parameter [ just curious what your use case is ]
@robertgshaw2-neuralmagic I am not sure why the OP wants this info, but for me, I would like to know if the engine is at its capacity, so I can decide whether I should submit more requests.
@robertgshaw2-neuralmagic I am not sure why the OP wants this info, but for me, I would like to know if the engine is at its capacity, so I can decide whether I should submit more requests.
vLLM's internal scheduler manages this internally. So you can send all your requests at once
Thanks for the timely reply! Is there a limit on how many request it can handle or queue? Or it will start rejecting when the limit is reached and the sender will find out then?
I found #2492 and #3561 are also useful for answering this question.
Apologies @timxzz @robertgshaw2-neuralmagic , I missed your earlier question regarding why I want to know if the vllm scheduler is full. I would like to know this because I am writing a distributed program that is running multiple vllm instances, and I would not like to submit to an instance which is already full. So if scheduler_config.max_num_seqs
contains the maximum number of running sequences can I simply do len(self.scheduler.running) < self.scheduler_config.max_num_seqs
to see if it is full or not?
Apologies @timxzz @robertgshaw2-neuralmagic , I missed your earlier question regarding why I want to know if the vllm scheduler is full. I would like to know this because I am writing a distributed program that is running multiple vllm instances, and I would not like to submit to an instance which is already full. So if
scheduler_config.max_num_seqs
contains the maximum number of running sequences can I simply dolen(self.scheduler.running) < self.scheduler_config.max_num_seqs
to see if it is full or not?
Yes, I believe that's correct. Additionally, you can utilize engine.scheduler.block_manager.get_num_free_gpu_blocks()
to determine whether the instance's memory is full or not.
I'm also interesting in this for the same reason as @man2machine. While the proposed solution works, it's not ideal as it requires constant polling from the client (the one sending requests), and it's not optimal (requests can sit in the waiting queue when they can have been processed by another less used vLLM instance).
@llx-08 @timxzz @man2machine @robertgshaw2-neuralmagic I'm curious to know what's your thoughts to avoid polling?
We have the same confusion @lionelvillard @man2machine, and not limiting the number of requests can leads to two problems
Hello guys,
Is there any limitation on the number of requests in the queue? Because regardless of the number of GPUs, the count of running + pending requests never exceeds 100.
For example, I'm sending 500 concurrent requests to my API, and some of their outputs are shown below:
2xA100
4xA100
When it still continues to receive requests, at the very least, I'm expecting to have the maximum number of Running requests and an increasing number of Pending requests continuously.
I tried to change max_num_seqs as explained in this and as you mentioned above, but nothing changed.
It seems I cannot utilize the full performance of 4 GPUs compared to 2 GPUs in this situation. Am I missing something?
@emirhanKural - how are you querying the server? Are you using a single OpenAI client? The OpenAI client can throttle the number of active requests it sends at a time
Hi!
I am using the aiohttp library to send asynchronous requests, and after what you said, I checked and found that it really has a default limit.
Thank you for your help.
Hello again,
I'd like to ask about an indirectly related issue, even though it's not directly associated with the topic at hand.
In my stress tests, I've observed that 2xGPUs perform better than 4xGPUs when sending concurrent requests.
Upon checking the logs, I found that the primary reason for this seems to be the "pending to running" phase. The transition from pending to running is slow, and batching doesn't start effectively until my CUDA blocks are full, the max_num_seqs reaches the necessary count, or the pending requests drop to zero. We can see this from the "Avg generation throughput."
For example, the behavior for 100 concurrent requests is as follows:
2xGPU quickly starts full-performance batching as the CUDA blocks get filled;
4xGPU, with its significantly more CUDA blocks, waits until max_num_seqs reaches 256 or pending requests drop to zero before starting;
By looking at the info timestamps, we can see that the "pending to running" process takes about 1 minute. When the number of requests increases, the delay before 4xGPUs start fully performing becomes even longer.
Am I missing something? Is there a way to speed up the Prefill step?
Given this, it seems more effective to use 4 instances with 2 GPUs each rather than a single instance with 8 GPUs in Production.
Your current environment
How would you like to use vllm
I know I can add requests to the AsyncLLMEngine using
add_request(
, but I am not sure how to find out whether the engine is full with requests and adding more requests will simply cause it to queue up, or it can accept more requests. How can I do this? Is there a way to also get a quantity, like how much % utilization the engine is at as well?