triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.31k stars 1.48k forks source link

Send multiple request to multiple task #6251

Open wxthu opened 1 year ago

wxthu commented 1 year ago

Description I am building a baseline for my engineering project. I want to send multiple request to multiple model and enable parallel executions when different models receives request simultaneously. But when I used the example script to do that, I found that no parallel execution works and the latency of async api was obviously longer that of sync api. Could you please give me some ideas? Thanks so much.

Following is my client script:

import argparse
import numpy as np
import sys
import time

import tritonclient.http as httpclient

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-v',
                        '--verbose',
                        action="store_true",
                        required=False,
                        default=False,
                        help='Enable verbose output')
    parser.add_argument('-u',
                        '--url',
                        type=str,
                        required=False,
                        default='localhost:8000',
                        help='Inference server URL. Default is localhost:8000.')

    FLAGS = parser.parse_args()

    model_names = ['vgg19', 'convnext']
    request_count = 1
    triton_clients = []
    try:
        # Need to specify large enough concurrency to issue all the
        # inference requests to the server in parallel.
        for _ in model_names:
            triton_client = httpclient.InferenceServerClient(
                url=FLAGS.url, verbose=FLAGS.verbose, concurrency=request_count)
            triton_clients.append(triton_client)

    except Exception as e:
        print("context creation failed: " + str(e))
        sys.exit()

    # Infer
    inputs = [[] for _ in range(len(model_names))]
    outputs = [[] for _ in range(len(model_names))]
    for i, ipt in enumerate(inputs):
        ipt.append(httpclient.InferInput('begin', [4, 3, 224, 224], "FP32"))

        # Create the data for the two input tensors.
        input0_data = np.zeros((4, 3, 224, 224), dtype=np.float32)

        # Initialize the data
        ipt[0].set_data_from_numpy(input0_data, binary_data=True)

    for j, opt in enumerate(outputs):
        opt.append(httpclient.InferRequestedOutput('output', binary_data=True))

    start_time = time.time()
    async_requests = []
    # Asynchronous inference call.
    # For each task
    for _ in range(100):
        for j, model_name in enumerate(model_names):
            for i in range(request_count):
                async_requests.append(
                    triton_clients[j].async_infer(model_name=model_name,
                                            inputs=inputs[j],
                                            outputs=outputs[j]))

    for async_request in async_requests:
        # Get the result from the initiated asynchronous inference request.
        # Note the call will block till the server responds.
        result = async_request.get_result()

        result = result.get_response()
    end_time = time.time()
    print("Overall async execution time: {} second".format(end_time - start_time))

Triton Information r23.07, build it myself tensorrt backend

kthui commented 1 year ago

Hi @wxthu, can you share the model config.pbtxt? If a model only has one instance and with dynamic batching disabled, it could be executing sequentially.

wxthu commented 1 year ago

Hi @wxthu, can you share the model config.pbtxt? If a model only has one instance and with dynamic batching disabled, it could be executing sequentially.

Thanks. The following are my configs:

name: "convnext"
backend: "tensorrt"
max_batch_size: 4
input [
  {
    name: "begin"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 1000 ]
    label_filename: "convnext_labels.txt"
  }
]

instance_group [
    {
      count: 1
      kind: KIND_GPU
    }
]

name: "vgg19"
backend: "tensorrt"
max_batch_size: 4
input [
  {
    name: "begin"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [1000]
    label_filename: "vgg_labels.txt"
  }
]

instance_group [
    {
      count: 1
      kind: KIND_GPU
    }
]

Additionally, what I would like to understand is, how is dynamic batching related to multiple requests of the same model and among multiple different models?。When there are simultaneous requests for four different models, how should I enable parallel inference for the four models

wxthu commented 1 year ago

Actually, I wonder that whether parallel execution of different model could be supported . If yes, how can I enable that; if not, why. Thanks so much

kthui commented 1 year ago

I think they are in parallel by default, since they are different models. Did you find it otherwise?

wxthu commented 1 year ago

I think they are in parallel by default, since they are different models. Did you find it otherwise?

Fine, I found no multiple processes in triton and let me check whether there are multiple streams. By the way, I really found i takes much longer using TritonClient async infer API than that of sync API And I use async inference api to simulate concurrent requests for different models, do you think it works? @kthui