Closed evanarlian closed 1 month ago
There was a fix for the HTTP client in async mode in 24.07 that resolved some unintended blocking that throttled requests. In most cases, the requests were throttled close to the max throughput, but not always for some types of models.
Note that means once r24.07 goes out, requests get sent almost immediately, so it's possible to go beyond what your system/model can handle, thereby driving up latency. This will be published in the release notes. You can try building the client now or trying out 24.07 once it is released soon.
Hi, thank you for the response. I saw the new 24.07 triton containers from NGC. I looked inside and the tritonclient version is 2.48, matching the latest version from pypi. Is that the new tritonclient? Anyway, I tried the same code snippet above (using 24.07 server image and 2.48 client from pypi) and I still see the same problem.
I might have some misunderstanding on how tritonclient aio supposed to work. The hope is, once the asyncio tasks creation finishes, I should see the logs (print statements) on Triton server almost immediately. During those ~3 secs gap I mentioned above, turns out I can see my system memory grows, and once the growth stops, the first request then started to enter Triton server.
If tritonclient aio cannot handle burst requests like this, what is the best way to handle from the client side?
Have you tried adding a flush=true to the print statement? It's possible the delay is due to flushing not happening immediately.
CC: @Tabrizian from the server team.
Just to make sure I'm understanding the problem clearly, the issue is that you're expecting the inferences to be executed as soon as asyncio.create_task
is complete but it looks like all the tasks are created first, then there is a delay and after that logs are created is that correct?
I think it might be that you need release the current thread while you're creating tasks to unblock the threads to execute. For example, adding the line below every N requests might help with executing some of the client requests you're creating.
await asyncio.sleep(0)
For flush=True, I've added those and the behavior is identical.
Now, for the asyncio issue.
the issue is that you're expecting the inferences to be executed as soon as asyncio.create_task is complete but it looks like all the tasks are created first, then there is a delay and after that logs are created is that correct?
This is mostly correct, but the logs here belongs inside Triton model, not the client. Here are some examples:
500 tasks scenario:
print("done sending")
executed30 tasks scenario:
print("done sending")
executedWhy does the number of tasks affect the initial data received by Triton server? It looks like triton client "bunches" up together requests, and then sends them all to Triton server. My original intuition tells me that the earlier inference requests should arrive at the server at around the same time, regardless of how many inference requests scheduled.
await asyncio.sleep(0)
works! This has solved the delay problem, now the earliest request will enter Triton server with almost no delay. But after some quick tests, this method will only work if we control the task creation, e.g. the loop with asyncio.create_task(...)
right? For web related tasks like using FastAPI, I still cannot inject the sleep(0) though. Maybe I'll find another solution for this. Thankyou.
Feel free to open a new Github issue, if you were still running into issues. The thread needs be released in order for the client to make progress and start actually sending the requests to the server.
Background
I have a simple FastAPI app that will call Triton server using Triton client. It works great for small, non-rapid requests. But the behavior starts to decline when I introduce a lot of concurrent requests, such as when using JMeter. Originally I thought the bottleneck was on my Triton server, but turns out it happened on the client (FastAPI app). Both Triton server and client are on my local PC.
Reproduce attempt
Below are all components required to reproduce this problem.
Docker commands for Triton server.
I opened 2 terminals, run docker commands on the first and
http_client_aio.py
on the second window. Here is what happened:http_client_aio.py
Questions