triton-inference-server / client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
BSD 3-Clause "New" or "Revised" License
551 stars 227 forks source link

Delayed aio infer during burst requests #733

Closed evanarlian closed 1 month ago

evanarlian commented 3 months ago

Background

I have a simple FastAPI app that will call Triton server using Triton client. It works great for small, non-rapid requests. But the behavior starts to decline when I introduce a lot of concurrent requests, such as when using JMeter. Originally I thought the bottleneck was on my Triton server, but turns out it happened on the client (FastAPI app). Both Triton server and client are on my local PC.

Reproduce attempt

Below are all components required to reproduce this problem.

# Repo structure
.
├── model_repository
│   └── dummy
│       ├── 1
│       │   └── model.py
│       └── config.pbtxt
<snip>
# config.pbtxt

backend: "python"
max_batch_size: 10

dynamic_batching {}

input [
  {
    name: "image"
    data_type: TYPE_FP32
    dims: [3, 1000, 1000]
  }
]

output [
  {
    name: "result"
    data_type: TYPE_INT32
    dims: [5]
  }
]
# model.py
import time

import numpy as np
import triton_python_backend_utils as pb_utils

class TritonPythonModel:
    def execute(self, requests):
        n = len(requests)
        print(f"received {n} requests, at {time.time()}", flush=True)
        # always return dummy tensor
        responses = []
        for i in range(n):
            responses.append(
                pb_utils.InferenceResponse(
                    [pb_utils.Tensor("result", np.arange(5, dtype=np.int32))]
                )
            )
        return responses
# http_client_aio.py
import asyncio

import numpy as np
import tritonclient.http.aio as httpclient

async def main():
    client = httpclient.InferenceServerClient(url="localhost:8000")
    fake_image = np.random.random((1, 3, 1000, 1000)).astype(np.float32)

    tasks = []
    for i in range(500):
        fake_input = httpclient.InferInput("image", fake_image.shape, "FP32")
        fake_input.set_data_from_numpy(fake_image)
        task = asyncio.create_task(client.infer("dummy", [fake_input]))
        tasks.append(task)
    print("done sending")

    await asyncio.gather(*tasks)
    await client.close()

if __name__ == "__main__":
    asyncio.run(main())

Docker commands for Triton server.

docker run --shm-size=1gb --rm -p8000:8000 -p8001:8001 -p8002:8002 -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:23.12-py3 tritonserver --model-repository=/models

I opened 2 terminals, run docker commands on the first and http_client_aio.py on the second window. Here is what happened:

  1. Start run http_client_aio.py
  2. "done sending" printed after ~2 secs.
  3. There are ~3 secs of delay which nothing seems to happen.
  4. Logs in Triton server (TritonPythonModel print statements) starts rolling in.

Questions

  1. Am I missing something?
  2. What happened on those ~3 secs delay?
  3. Can I saturate Triton server using just aio client with relatively big inference payload (I use 1000x1000 image resolution)?
  4. What are the best practices to handle this bottleneck?
dyastremsky commented 2 months ago

There was a fix for the HTTP client in async mode in 24.07 that resolved some unintended blocking that throttled requests. In most cases, the requests were throttled close to the max throughput, but not always for some types of models.

Note that means once r24.07 goes out, requests get sent almost immediately, so it's possible to go beyond what your system/model can handle, thereby driving up latency. This will be published in the release notes. You can try building the client now or trying out 24.07 once it is released soon.

evanarlian commented 2 months ago

Hi, thank you for the response. I saw the new 24.07 triton containers from NGC. I looked inside and the tritonclient version is 2.48, matching the latest version from pypi. Is that the new tritonclient? Anyway, I tried the same code snippet above (using 24.07 server image and 2.48 client from pypi) and I still see the same problem.

I might have some misunderstanding on how tritonclient aio supposed to work. The hope is, once the asyncio tasks creation finishes, I should see the logs (print statements) on Triton server almost immediately. During those ~3 secs gap I mentioned above, turns out I can see my system memory grows, and once the growth stops, the first request then started to enter Triton server.

If tritonclient aio cannot handle burst requests like this, what is the best way to handle from the client side?

dyastremsky commented 2 months ago

Have you tried adding a flush=true to the print statement? It's possible the delay is due to flushing not happening immediately.

CC: @Tabrizian from the server team.

Tabrizian commented 2 months ago

Just to make sure I'm understanding the problem clearly, the issue is that you're expecting the inferences to be executed as soon as asyncio.create_task is complete but it looks like all the tasks are created first, then there is a delay and after that logs are created is that correct?

I think it might be that you need release the current thread while you're creating tasks to unblock the threads to execute. For example, adding the line below every N requests might help with executing some of the client requests you're creating.

await asyncio.sleep(0)
evanarlian commented 2 months ago

For flush=True, I've added those and the behavior is identical.

Now, for the asyncio issue.

the issue is that you're expecting the inferences to be executed as soon as asyncio.create_task is complete but it looks like all the tasks are created first, then there is a delay and after that logs are created is that correct?

This is mostly correct, but the logs here belongs inside Triton model, not the client. Here are some examples:

500 tasks scenario:

  1. [client] I schedule 500 aio inference calls.
  2. [client] print("done sending") executed
  3. [server] Wait long time
  4. [server] Triton logs start showing up

30 tasks scenario:

  1. [client] I schedule 30 aio inference calls.
  2. [client] print("done sending") executed
  3. [server] Wait short time
  4. [server] Triton logs start showing up

Why does the number of tasks affect the initial data received by Triton server? It looks like triton client "bunches" up together requests, and then sends them all to Triton server. My original intuition tells me that the earlier inference requests should arrive at the server at around the same time, regardless of how many inference requests scheduled.

evanarlian commented 2 months ago

await asyncio.sleep(0) works! This has solved the delay problem, now the earliest request will enter Triton server with almost no delay. But after some quick tests, this method will only work if we control the task creation, e.g. the loop with asyncio.create_task(...) right? For web related tasks like using FastAPI, I still cannot inject the sleep(0) though. Maybe I'll find another solution for this. Thankyou.

Tabrizian commented 1 month ago

Feel free to open a new Github issue, if you were still running into issues. The thread needs be released in order for the client to make progress and start actually sending the requests to the server.