triton-inference-server / pytriton

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.
https://triton-inference-server.github.io/pytriton/
Apache License 2.0
687 stars 45 forks source link

Streaming and batching #43

Closed giuseppe915 closed 5 months ago

giuseppe915 commented 8 months ago

Hi, I'm new in pytriton. I am trying to deploy a model and make inference with a text-generation pipeline, I managed to get the streaming to work as per the example. I would like to know how I can scale the deployment and whether batching is compatible with streaming?

Thank you in advance

giuseppe915 commented 7 months ago

Hi do you have any news on this?

pziecina-nv commented 7 months ago

Hi @giuseppe915 ,

Firstly, I apologize for the late response to your query.

Here are some scale strategies you can consider:

  1. Dynamic Batching with Triton's Dynamic Batcher: Triton offers a dynamic batching feature that allows you to aggregate requests into batches dynamically. To effectively use this, choose the max_batch_size considering the trade-off between latency and throughput.

    Additionally, PyTriton will soon support a decoupled mode, expected by the end of this month. This mode will enable more application-specific batching strategies, such as continuous batch processing, which can be particularly beneficial for large language models (LLMs).

  2. Distributing the Model Across Multiple Devices: To scale your deployment, consider distributing the model across multiple GPUs. This can be done in two ways:

    • Multiple Instances of Inference Callables: Implement data parallelism by running multiple instances of the inference callable across different devices. This approach helps in distributing the workload evenly.
    • Model Parallelism Techniques: Utilize model parallelism strategies, including tensor parallelism and pipeline parallelism. These techniques divide the model itself across multiple GPUs and/or nodes, allowing for concurrent processing of different parts of the model.
  3. Model Optimizations for LLMs: For LLMs, consider using model optimizations to enhance performance. TensorRT-LLM is a viable option that provides state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. We are currently preparing an example to demonstrate the use of TensorRT-LLM with PyTriton, which should be available soon.

Regarding your question about batching compatibility with streaming: the streaming example uses Triton's dynamic batcher (enabled by default) and the @batch decorator. Could you clarify if you are referring to a different type of batching?

giuseppe915 commented 7 months ago

HI @pziecina-nv , thanks for your reply. Regarding the question about batching compatibility, when I run the examples the request are batched but I received an error regarding the response batch dimensions that not is equal to the request batch dimensions. How I can avoid this?

Thanks Giuseppe

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days.

giuseppe915 commented 6 months ago

Hi, there is any update on this?

piotrm-nvidia commented 5 months ago

The new version of PyTriton 0.5.0 contains fixes to correctly handle of batch dimentions for decoupled models. Still you must be careful to correctly set the batch dimensions in the model configuration and the input and output tensors. I prepared for you a code example that demonstrates how to use the @batch decorator and the decoupled=True setting to handle batch processing correctly.

Understanding Shapes and Batch Dimensions

Before diving into the code, it's essential to understand how data shapes and batch dimensions work. Consider an input tensor for a model. The shape of this tensor usually includes:

  1. Batch Dimension: The size of this dimension represents how many inputs you're processing at once. For instance, a batch dimension of 32 means you're processing 32 inputs simultaneously.
  2. Feature Dimensions: These dimensions represent the features of your input. For an image, this might include the height, width, and the number of color channels.

Let's look at a hypothetical example of a batch of inputs with shape (-1, 3, 224, 224), where -1 represents the batch dimension, 3 represents the number of color channels, and 224x224 represents the image dimensions. This means you're processing a batch of images, each with 3 color channels and a resolution of 224x224 pixels.:

[Batch Size, Feature Dim 1, Feature Dim 2, ...]
    |           |               |
    v           v               v
[***********][***********]  [***********]  ...  [***********]
   Input 1     Input 2         Input 3           Input N

Let's assume that you output tensor of class probabilities has the same shape as the input tensor, i.e., (-1, 1000), where -1 represents the batch dimension and 1000 represents the number of classes. This means that for each input in the batch, you get a vector of class probabilities with 1000 elements:

[Batch Size, Num Classes]
    |           |
    v           v
[***********][***********]  ...  [***********]
   Output 1    Output 2           Output N

In code example below, inputs has the tensor shape is (-1, ), meaning you have an unspecified number of 1-D inputs. This is the simplest case, but the same principles apply to higher-dimensional inputs.

Using the @batch Decorator

The @batch decorator from pytriton is a powerful tool that automatically handles batch processing for your inference function (_infer_fn in your case). Here's what's happening in your code:

  1. Batch Processing: The @batch decorator allows your inference function to process inputs in batches. This means that input in _infer_fn is a batch of inputs rather than a single input. It also expects the function to yield outputs for each input in the batch matchin batch size.
  2. Delayed Execution with yield: Your function uses yield to create a generator. This allows you to return results one at a time, simulating a scenario where the processing of each input might take some time (simulated with time.sleep(2.0)).

Code

Server Side:

from pytriton.decorators import batch
import time
import numpy as np

# Decorate your model function with `@batch`. This allows Triton to batch multiple requests together.
@batch
def _infer_fn(input):
    for _ in range(3):
        time.sleep(2.0)
        yield {"output": input}

# Create a Triton model configuration and bind it to the model function `_infer_fn`.
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton, TritonConfig
triton = Triton()
triton.bind(
    model_name="Test",
    infer_func=_infer_fn,
    inputs=[
        Tensor(name="input", dtype=np.float64, shape=(-1,)),  
        # Shape with a batch dimension (-1) to support variable-sized batches.
    ],
    outputs=[
        Tensor(name="output", dtype=np.float64, shape=(-1,)),  
        # Output shape with a batch dimension (-1).
    ],
    config=ModelConfig(decoupled=True),
)

# Start the Triton server without blocking, you can do it in notebook.
triton.run()

Client Side:

from pytriton.client import DecoupledModelClient

# Create a client for the "Test" model running on the local machine using gRPC.
client = DecoupledModelClient("grpc://localhost", "Test")

# Send multiple requests to the server in a single call using `infer_batch()`.
for result in client.infer_batch(np.array([[0.1],[0.2]])):
    print("RESULT", time.time(), result)

Output:

The output shows that you receive results individually for each input, each after a 2-second delay introduced in your server-side code:

RESULT 1706971257.5856066 {'output': array([[0.1],
       [0.2]])}
RESULT 1706971259.586204 {'output': array([[0.1],
       [0.2]])}
RESULT 1706971261.5903184 {'output': array([[0.1],
       [0.2]])}

Conclusion

By ensuring that your input and output tensors are correctly configured and by utilizing the @batch decorator and decoupled=True setting, you're effectively instructing Triton to handle batch processing in a way that matches the request batch dimensions with the response batch dimensions, thereby avoiding the mismatch error you were encountering.

If you have any further questions or need additional assistance, feel free to ask!