Closed giuseppe915 closed 5 months ago
Hi do you have any news on this?
Hi @giuseppe915 ,
Firstly, I apologize for the late response to your query.
Here are some scale strategies you can consider:
Dynamic Batching with Triton's Dynamic Batcher: Triton offers a dynamic batching feature that allows you to aggregate requests into batches dynamically. To effectively use this, choose the max_batch_size
considering the trade-off between latency and throughput.
Additionally, PyTriton will soon support a decoupled mode, expected by the end of this month. This mode will enable more application-specific batching strategies, such as continuous batch processing, which can be particularly beneficial for large language models (LLMs).
Distributing the Model Across Multiple Devices: To scale your deployment, consider distributing the model across multiple GPUs. This can be done in two ways:
Model Optimizations for LLMs: For LLMs, consider using model optimizations to enhance performance. TensorRT-LLM is a viable option that provides state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. We are currently preparing an example to demonstrate the use of TensorRT-LLM with PyTriton, which should be available soon.
Regarding your question about batching compatibility with streaming: the streaming example uses Triton's dynamic batcher (enabled by default) and the @batch
decorator. Could you clarify if you are referring to a different type of batching?
HI @pziecina-nv , thanks for your reply. Regarding the question about batching compatibility, when I run the examples the request are batched but I received an error regarding the response batch dimensions that not is equal to the request batch dimensions. How I can avoid this?
Thanks Giuseppe
This issue is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days.
Hi, there is any update on this?
The new version of PyTriton 0.5.0 contains fixes to correctly handle of batch dimentions for decoupled models. Still you must be careful to correctly set the batch dimensions in the model configuration and the input and output tensors. I prepared for you a code example that demonstrates how to use the @batch
decorator and the decoupled=True
setting to handle batch processing correctly.
Before diving into the code, it's essential to understand how data shapes and batch dimensions work. Consider an input tensor for a model. The shape of this tensor usually includes:
Let's look at a hypothetical example of a batch of inputs with shape (-1, 3, 224, 224), where -1 represents the batch dimension, 3 represents the number of color channels, and 224x224 represents the image dimensions. This means you're processing a batch of images, each with 3 color channels and a resolution of 224x224 pixels.:
[Batch Size, Feature Dim 1, Feature Dim 2, ...]
| | |
v v v
[***********][***********] [***********] ... [***********]
Input 1 Input 2 Input 3 Input N
Let's assume that you output tensor of class probabilities has the same shape as the input tensor, i.e., (-1, 1000), where -1 represents the batch dimension and 1000 represents the number of classes. This means that for each input in the batch, you get a vector of class probabilities with 1000 elements:
[Batch Size, Num Classes]
| |
v v
[***********][***********] ... [***********]
Output 1 Output 2 Output N
In code example below, inputs has the tensor shape is (-1, ), meaning you have an unspecified number of 1-D inputs. This is the simplest case, but the same principles apply to higher-dimensional inputs.
@batch
DecoratorThe @batch
decorator from pytriton
is a powerful tool that automatically handles batch processing for your inference function (_infer_fn
in your case). Here's what's happening in your code:
@batch
decorator allows your inference function to process inputs in batches. This means that input
in _infer_fn
is a batch of inputs rather than a single input. It also expects the function to yield outputs for each input in the batch matchin batch size.yield
: Your function uses yield
to create a generator. This allows you to return results one at a time, simulating a scenario where the processing of each input might take some time (simulated with time.sleep(2.0)
)._infer_fn
as your inference function and decorate it with @batch
. This function processes inputs in batches and yields outputs one at a time.decoupled=True
in your ModelConfig
. This is important because it allows the Triton server to send responses back as soon as they're ready, rather than waiting for the entire batch to be processed. This is crucial for your setup since you're introducing an artificial delay (with time.sleep
) in your inference function.from pytriton.decorators import batch
import time
import numpy as np
# Decorate your model function with `@batch`. This allows Triton to batch multiple requests together.
@batch
def _infer_fn(input):
for _ in range(3):
time.sleep(2.0)
yield {"output": input}
# Create a Triton model configuration and bind it to the model function `_infer_fn`.
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton, TritonConfig
triton = Triton()
triton.bind(
model_name="Test",
infer_func=_infer_fn,
inputs=[
Tensor(name="input", dtype=np.float64, shape=(-1,)),
# Shape with a batch dimension (-1) to support variable-sized batches.
],
outputs=[
Tensor(name="output", dtype=np.float64, shape=(-1,)),
# Output shape with a batch dimension (-1).
],
config=ModelConfig(decoupled=True),
)
# Start the Triton server without blocking, you can do it in notebook.
triton.run()
DecoupledModelClient
from pytriton.client
to send requests to your Triton server.infer_batch
to send a batch of inputs ([0.1], [0.2]) to the server.from pytriton.client import DecoupledModelClient
# Create a client for the "Test" model running on the local machine using gRPC.
client = DecoupledModelClient("grpc://localhost", "Test")
# Send multiple requests to the server in a single call using `infer_batch()`.
for result in client.infer_batch(np.array([[0.1],[0.2]])):
print("RESULT", time.time(), result)
The output shows that you receive results individually for each input, each after a 2-second delay introduced in your server-side code:
RESULT 1706971257.5856066 {'output': array([[0.1],
[0.2]])}
RESULT 1706971259.586204 {'output': array([[0.1],
[0.2]])}
RESULT 1706971261.5903184 {'output': array([[0.1],
[0.2]])}
By ensuring that your input and output tensors are correctly configured and by utilizing the @batch
decorator and decoupled=True
setting, you're effectively instructing Triton to handle batch processing in a way that matches the request batch dimensions with the response batch dimensions, thereby avoiding the mismatch error you were encountering.
If you have any further questions or need additional assistance, feel free to ask!
Hi, I'm new in pytriton. I am trying to deploy a model and make inference with a text-generation pipeline, I managed to get the streaming to work as per the example. I would like to know how I can scale the deployment and whether batching is compatible with streaming?
Thank you in advance