Subsequent requests cannot be sent until 'num_concurrent_requests' requests have all finished

llsj14 commented 1 week ago

Hello,

I've encountered an issue where the request launcher does not allow the next requests to be sent until all requests specified by num_concurrent_requests have finished.

This behavior seems counterintuitive for benchmarking TTFT and throughput in Continuous Batching systems accurately, as it can block subsequent requests even when the serving system is capable of handling them.

To address this, I believe the get_next_ready function should be modified as follows, enabling it to return results as soon as each individual request is completed:

--- a/src/llmperf/requests_launcher.py
+++ b/src/llmperf/requests_launcher.py
@@ -40,6 +40,7 @@ class RequestsLauncher:
         if not block:
             while self._llm_client_pool.has_next():
                 results.append(self._llm_client_pool.get_next_unordered())
+                return results
         else:
             while not self._llm_client_pool.has_next():
                 pass

I am prepared to submit a pull request with this change and would appreciate your feedback.

Thank you.

llsj14 commented 1 week ago

I found that the previous revision isn't sufficient to solve the problem. To send a request asynchronously right after the previous one finishes, many parts need fixing. I attempted to make the get_next_ready function asynchronous, but it depends on ray.util.get_next_unordered(). Converting it to an asynchronous function is challenging due to its current blocking implementation.

Here is the link to the relevant code: ray/util/actor_pool.py lines 311-326.

I think there are two potential approaches for change:

Convert the get_next_ready function into an asynchronous function.
Enable clients to operate independently using multi-threading or multi-processing, while using current get_next_ready implementation.

ashutoshsaboo commented 2 days ago

Hey @llsj14 I'm facing the same issue - without issuing concurrent requests at a set rate, it's no longer a proper load testing framework, do you have plans to fix this?

ray-project / llmperf

Subsequent requests cannot be sent until 'num_concurrent_requests' requests have all finished #56