triton-inference-server / pytriton

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.
https://triton-inference-server.github.io/pytriton/
Apache License 2.0
719 stars 50 forks source link

[problem]How to allowed multiple models running on same GPU at same time? #48

Closed Firefly-Dance closed 8 months ago

Firefly-Dance commented 9 months ago

I'm using pytriton in my project, but I'm facing some problems.

My core codes in server sides are as follow:

start = Event(enable_timing=True)
end = Event(enable_timing=True)  

@batch
def Densenet169(tensor,num = -1):
    # logging.info(f'num:{num}')
    tensor = torch.from_numpy(tensor).cuda(cuda_num)
    start.record()
    result = model2(tensor,num)
    end.record()
    synchronize()
    time = start.elapsed_time(end)
    return {'result' : result.cpu().detach().numpy(),
            'time': np.full([1,1],time,dtype=np.float32)}

batcher = DynamicBatcher(preferred_batch_size=[32])

with Triton() as triton:
    triton.bind(
        model_name="Densenet169",
        infer_func=Densenet169,
        inputs=[
            Tensor(name = 'tensor',dtype=np.float32, shape=(-1,-1,-1,)),
            Tensor(name = 'num',dtype=np.int32, shape=(1,)),
        ],
        outputs=[
            Tensor(name = 'result',dtype=np.float32, shape=(-1,)),
            Tensor(name = 'time',dtype=np.float32, shape=(1,))
        ],
        config=ModelConfig(batching=True,max_batch_size=128,batcher=batcher),
        # config=ModelConfig(batching=False),
        strict=False,
    )
    triton.serve()

As you can see, there are two inputs and two outputs in my server's setting, one output 'result' is the result of input 'tensor', and the other is CUDA running time while inference.

The 'result' is not that important but 'time' is what I need. Triton do have metrics, but only a rough number, not precise enough as I need. I want each time it cost in inference (or several times, but the fact is Triton's metrics is enough for me).

After I modify my infer_func, things are going unexpected then:

  1. I need to maximize GPU utilize rate, so I try to increase batchsize, which send requests more frequently by open more clients, but pytriton will merge http request into one request, which will calculate more than one request running time and return as a single running time with a void return to be merged requests. Which is not what I expected.
  2. So I tried to send more than one sample in one batch. Which is prohibited by pytriton, since input can only be a shape as a single sample. I have not tried set 'restrict' as False in this case, but I doubt if this could work, so I didn't tried it yet.
  3. And then, I found another package called tritonclient, with that I do successfully pass more than one sample in one batch by following codes:

        request = tritonhttpclient.InferInput('tensor', y.shape, 'FP32')
        request_num = tritonhttpclient.InferInput('num', [batchsize, 1], 'INT32')
        request.set_data_from_numpy(y)
        request_num.set_data_from_numpy(num)
    
        response = triton_client.infer(model_name, inputs=[request, request_num], model_version='1')
  4. Above all, I do increase the GPU utilize rate. However, two problems still bother me:

    a) How to close auto merge http request without set 'batching' as False? As I mentioned in 2, pytriton will merge http request into one request, is there any way to close that? I tried to use 'DynamicBatcher' as code above, success as it is, it's still weird and not nature at all. Could there be a more elegant way?

    b) How to allowed multiple models running on same GPU at same time? The GPU utilize rate is not high enough, I wish to allowed requests running immediately when server receive it. So that multiple models could running on same GPU at same time.What should I do?

    c) Extra problem. Is there any method to achieve my purpose in an elegant way? I want to maximize GPU utilize rate as high as possible so that every models could influence each other out of scientific purpose.

Firefly-Dance commented 9 months ago

I was using

nvidia-pytriton           0.2.5    

but things stuck while I update nvidia-pytriton to 0.4.2, the output is

ValueError: Received output tensors with different batch sizes: time: (1, 1). Expected batch size: 32. 

while 32 is my batchsize as well as I defined in 'batcher'

Things going more complex than I expected. It seems like more shape check was added during update.

If this means I should check about old version of nvidia-pytriton and pray things could going on as I expected?

Firefly-Dance commented 9 months ago

I maybe solve the problem myself with long time straggle.

  1. ValueError is caused by upgrade in 0.4.0 since it add following change: Change: "batch" decorator raises a ValueError if any of the outputs have a different batch size than expected. which cause ValueError as I commented before.So change to an older version like 0.3.0 will make it work again even though it should raise that error when in real world application. But why not give the control right to user?

  2. Multiple models running on same GPU at same time is possible since version 0.2.5 with changes as follow: new: Allow to execute multiple PyTriton instances in the same process and/or host By compare 0.2.4 with 0.2.5, a new example called multi_instance_resnet50_pytorchis added. Which might indicate a new feature called multi_instance could solve my problem.

I will close this issue after I test upon my own project.

piotrm-nvidia commented 9 months ago

Your inquiries touch on several key aspects of using the PyTriton server, especially regarding batch processing, maximizing GPU utilization, and handling multiple models. Let's address your points one by one:

  1. Auto Merge of HTTP Requests and Batching:

    In Triton, the merging of HTTP requests into a single request is a standard behavior to optimize throughput and resource utilization. One way to do this is by setting a preferred_batch_size in the DynamicBatcher but you can also set max delay.

  2. Multiple Samples in One Batch Prohibition:

    PyTriton's restriction on input shapes aligns with the typical use case of batch processing where each input tensor in a batch has the same shape. However, if your application necessitates different shapes within a single batch, you might consider modifying the input processing in your infer_func to handle this variability. ModelClient's infer_batch method requires the batch dimension to be added by the user and in infer_sample the batch dimension is handled by ModelClient internal logic.

  3. Running Multiple Models on the Same GPU:

    To run multiple models simultaneously on the same GPU, you can leverage Triton's support for multiple model instances. Also, consider using Triton's model prioritization and queuing features to manage the execution order and resource allocation effectively. When you create several instances, you limit how big batch can be created because Triton will try to split work among instances and won't batch so many requests in single input tensor. Multiple instances are helpful, when your model requires many layers and you want to utilize GPU better. PyTriton doesn't control how you use GPU so in your model you must push numpy tensors to GPU and get them back. You can put as many instances of model as you like in single GPU until you hit memory limit. You can also use multiple GPUs to run multiple models simultaneously. ResNet50 example is useful for machine with multiple GPUs like NVIDIA DGX but you can adjust it to run several model instances at single GPU.

  4. Handling the ValueError:

    The ValueError you encountered is due to the stricter output batch size checking introduced in PyTriton. This check ensures consistency in batch sizes across inputs and outputs, which is crucial for most inference scenarios. However, if your use case requires different batch sizes, you might need to disable strict mode and don't used @batch decorator, which assume that your are using simple batching. Triton supports decoupled mode, where you can get directly all requests from users without dynamic batching handled by Triton. You can use DecoupledModelClient for such use case.

  5. Maximizing GPU Utilization:

    Maximizing GPU utilization involves balancing the number of model instances, the size of the batches, and the complexity of the models. You might need to experiment with these parameters to find the optimal configuration for your specific GPU and models. Additionally, consider using NVIDIA's profiling tools to identify bottlenecks and optimize your setup.

Remember, the goal is to strike a balance between the efficient use of GPU resources and the specific needs of your application, and this often involves a combination of configuration adjustments, client-side management, and potentially customizing the server-side code.

I used your code to create simple example of model, which just returns input tensor full of zeros. You can use it as a template to create your own model with working batching and multiple instances. I used PyTriton 0.4.2.

# Import the necessary modules from PyTriton
from pytriton.decorators import batch
from pytriton.triton import Triton
from pytriton.model_config import DynamicBatcher, ModelConfig, Tensor
import numpy as np

# Define a function that uses the Densenet169 model to process a tensor input
# The @batch decorator handles batch requests to match them with function parameters
@batch
def Densenet169(tensor, num=-1):
    # Create a dummy result array with the same batch size as the input tensor
    result = np.zeros((tensor.shape[0], 1), np.float32)
    return {'result': result}

batcher = DynamicBatcher(preferred_batch_size=[32])

# Specify the number of instances to use for the model
# Multiple instances can improve GPU utilization, but batching may be enough
# You should test different configurations to find the optimal one for your model and GPU
INSTANCES_COUNT = 8

# Create a Triton object
triton = Triton()

# Bind the Triton object to the model name, function, inputs, outputs, and config
triton.bind(
    model_name="Densenet169",
    infer_func=[Densenet169] * INSTANCES_COUNT, # Use the same function for all instances
    inputs=[
        # Define the input tensors with their names, data types, and shapes
        # The shape of a single sample is (-1, -1, -1), meaning any size in each dimension
        Tensor(name='tensor', dtype=np.float32, shape=(-1, -1, -1)),
        # The shape of a single sample is (1,), meaning tensor with fixed size
        Tensor(name='num', dtype=np.int32, shape=(1,)),
    ],
    outputs=[
        # Define the output tensor with its name, data type, and shape
        # The shape of a single sample is (-1,), meaning a vector of any size
        Tensor(name='result', dtype=np.float32, shape=(-1,)),
    ],
    config=ModelConfig(
        batching=True, # Enable batching for the model
        max_batch_size=128, # Set the maximum batch size to 128
        batcher=batcher, # Use the DynamicBatcher object created earlier
    ),
    strict=True, # Enable strict mode to check for errors
)

To start the Triton server in interactive mode and check if it works as expected, you can run the following code:

# Start the Triton server
triton.run()

To use the same Python environment to run client code, you can run the following code:

# Import the ModelClient class from PyTriton
from pytriton.client import ModelClient

# Create a ModelClient object with the server URL and the model name
client = ModelClient("http://localhost", "Densenet169")

To send a single sample to the server, you can use the infer_sample method:

# Create a sample input tensor with shape (1, 4, 4)
tensor_sample = np.zeros((1, 4, 4), np.float32)
# Create a sample input scalar
num_sample = np.zeros(1, np.int32)

# Send the sample to the server and print the shape of the output tensor
print(client.infer_sample(tensor=tensor_sample, num=num_sample)["result"].shape)

Output:

(1,)

The Triton shapes define the shape of a single sample. The batch dimension is added by Triton when you enable batching.

To send multiple samples to the server, you can use the infer_batch method:

# Create a batch of two input tensors with shape (2, 1, 4, 4)
# The first dimension is the batch size
tensor_batch = np.zeros((2, 1, 4, 4), np.float32)
# Create a batch of two input scalars with shape (2, 1)
# The first dimension is the batch size
num_batch = np.zeros((2, 1), np.int32)

# Send the batch to the server and print the shape of the output tensor
print(client.infer_batch(tensor=tensor_batch, num=num_batch)["result"].shape)

Output:

(2, 1)

The infer_batch method requires the batch dimension to be added by the user.

I hope that this answer will help you to solve your problem.

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 8 months ago

This issue was closed because it has been stalled for 7 days with no activity.