triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.04k stars 1.44k forks source link

Multi instance a model in GPU does not increase the throughput in Triton. #7108

Open ign4si opened 5 months ago

ign4si commented 5 months ago

Description Multi-instantiating a model in a GPU does not increase the efficiency when requesting from two different threads. Triton Information +----------------------------------+--------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+--------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.42.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_con | | | figuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logg | | | ing | | model_repository_path[0] | /models | | model_control_mode | MODE_NONE | | strict_model_config | 0 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+--------------------------------------------------------------------------------------------------------+

To launch the server I use the following command

docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v $(pwd)/model_repository:/models nvcr.io/nvidia/tritonserver:24.01-py3 tritonserver --model-repository=/models

These are my GPU specs image

To Reproduce I use a ResNet50 model, this is the configuration file I use.

      name: "resnet50"
      platform: "pytorch_libtorch"
      max_batch_size : 4
      input [
        {
          name: "input__0"
          data_type: TYPE_FP32
          dims: [ 3, 224, 224 ]
        }
      ]
      output [
        {
          name: "output__0"
          data_type: TYPE_FP32
          dims: [ 1000 ,1, 1] 
        }
      ]

      instance_group [
        {
          count: 4
          kind: KIND_GPU
          gpus: [ 0 ]
        }
      ]
      dynamic_batching {
      }

I wanted to send batches with size equal to max_batch_size, so I am sure that when two batches arrive at the same time they will be sent by the server to two differents instances of the model.

For running my script I use the following code:

import numpy as np
import tritonclient.http as httpclient
from PIL import Image
from torchvision import transforms
from tritonclient.utils import triton_to_np_dtype
import time
import torch
# preprocessing function
def rn50_preprocess(img_path="img1.jpg"):
    img = Image.open(img_path)
    preprocess = transforms.Compose(
        [
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ]
    )
    return preprocess(img).numpy()

transformed_img = rn50_preprocess()
transformed_img = np.expand_dims(transformed_img, axis=0)
transformed_img=np.concatenate((transformed_img,transformed_img),axis=0)
transformed_img=np.concatenate((transformed_img,transformed_img),axis=0) #just to reach the max_batch_size

# Setting up client
client = httpclient.InferenceServerClient(url="localhost:8000")

inputs = httpclient.InferInput("input__0", transformed_img.shape, datatype="FP32")
inputs.set_data_from_numpy(transformed_img, binary_data=True)

outputs = httpclient.InferRequestedOutput(
    "output__0", binary_data=True, class_count=1000
)
starting_time = time.time()
# Querying the server
for i in range(1000):
    start = time.time()
    results = client.infer(model_name="resnet50", inputs=[inputs], outputs=[outputs])
    end=time.time()
    print(f"Time taken for inference {(end-start)*1000}")

First, I instantiate just ONE (1) model on my GPU. Then I run my python script and I receive a response in aprox 11ms. When I run the second thread while the first one is running, the response time increases, which makes sense since the server is receiving two request and has just one instance to process.

Then I do the same experiment, but I instantiate more models on my GPU. Then, when I instantiate two models, I expect the server to handle the requests and send them to the model instance that is free for inferencing. Here, I anticipate a reduction in the time it takes to send requests from two threads. However, the average response time remains the same as with one model instance. I attached a plot from the results, showing the response time for the first thread. The jump in the response time is due to the start of the second thread. As you can see, the response time is the same regardless of the number of instances of the model, which does not make sense to me. What could be happening?

image

decadance-dance commented 5 months ago

I think we faced a similar issue. #7075