triton-inference-server / pytriton

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.
https://triton-inference-server.github.io/pytriton/
Apache License 2.0
687 stars 45 forks source link

Example of (or support for) Inference Callable of Triton ensemble definition #31

Closed michaelhagel closed 9 months ago

michaelhagel commented 11 months ago

Is your feature request related to a problem? Please describe. I currently have an automated config generator for creating proper ensemble pbtxt configs over a DAG of backends. I currently do not see a natural way of defining and/or referencing Triton ensembles with the PyTriton module.

Describe the solution you'd like I would like to either: (a) create the ensemble definition directly with the PyTriton module, or (b) load all ensemble models and the associated (already generated via above method) ensemble and model configs

Describe alternatives you've considered Our current solution mentioned in the feature request does the job, but does not utilize PyTriton's friendly interface.

Additional context N/A

michaelhagel commented 11 months ago

I should add that I'm willing to contribute stronger ensemble support as well, given discussion and RFC.

Lzhang-hub commented 10 months ago

@michaelhagel Is there some code about automated config generator in the NVIDIA repo, I am looking for it , Thanks very much if you can give some advices.

jkosek commented 10 months ago

@Lzhang-hub if you look for automated config generator you may want to take a look at https://github.com/triton-inference-server/model_navigator. There is a functionality to create the Triton model store with generate config.pbtxt: https://triton-inference-server.github.io/model_navigator/0.7.1/triton/triton_deployment/

However, there is no ensemble support.

michaelhagel commented 10 months ago

@Lzhang-hub sorry, I should have clarified -- that ensemble generator is currently part of some tooling at the company I currently work for, and is (for the moment, hopefully not for long) closed source.

In context of the feature request, I also suppose the config generator part is not truly applicable to my question. To reclarify: there is currently great support for ensembes within Triton Inference Server, where an ensemble is viewed as a single model from the client's point of view. This abstraction is in many cases preferable to writing a single Python backend with multiple steps as each "step" (read: model or processing) can be optimized independently with other Triton primitives. In addition, the ensemble definition is then not in Python code and allows for relatively easy config generation, allowing engineers to abstract away the serving process from the model developers/data scientists.

Here we can see an example of an ensemble of Model Navigator packages are served w/ PyTriton without the Triton ensemble scheduler.

Is this the approach preferred by the PyTriton team? I am still undecided as to whether this approach eschews some of the benefits of the native ensemble scheduling approach or is preferable...

piotrm-nvidia commented 10 months ago

In my opinion PyTriton is not designed to be a model management tool. It does not support creating or loading ensemble models directly from the library. PyTriton is designed to be a Python environment that allows you to run inference requests on Triton Inference Server using a simple and intuitive interface.

Why do you need ensemble in PyTriton?

Ensembles are useful for creating complex pipelines of models that can run on different backends and devices. For example, you may want to use an ensemble to combine a PyTorch model that runs on a CPU with an ONNX model that runs on a GPU, and pass the output of one model to the input of another. Ensembles can also help you optimize the performance and latency of your models by using features such as dynamic batching or response cache.

Can you tell me more about your use case and why you need to create and load ensemble models directly from PyTriton? I would like to understand your scenario better.

  1. Which of ensemble features are useful in your scenario?
  2. Do you want to use dynamic batching or response cache?
  3. Do you want to use ensemble for combining models from different backends and devices?

Ensembles-like flow in PyTriton

There are two ways you can implement something very similar to ensembles with PyTriton:

Callable similar to Triton Ensemble Definition

PyTriton can be used to build complex pipeline in Python and run it directly in inference callable. For example, you can use the following code to run two PyTorch models in sequence:

import numpy as np
from pytriton.decorators import batch
import torch

model_1 = torch.nn.Linear(2, 3).eval()
model_2 = torch.nn.Linear(3, 4).eval()

@batch
def infer_fn(**inputs: np.ndarray):
    (input1_batch,) = inputs.vaWe hope this helps you with your AI projects. Please let us know if you have any other questions or feedback. We are always happy to hear from our users.lues()
    input1_batch_tensor = torch.from_numpy(input1_batch)
    # Running ensemble-like flow directly in python
    output1_batch_tensor = model_2(model_1(input1_batch_tensor))
    output1_batch = output1_batch_tensor.detach().numpy()
    return [output1_batch]

Triton Model Navigator inplace optimization

Triton Model Navigator enable zero-copy data passing between heterogeneous model formats, such as PyTorch and ONNX. It contains runners so you can run heterogeneous models in a single inference request.

Navigator developers are also testing an alpha feature for complex models, which can be fully converted into ONNX or similar formats. It is called inplace optimization. It can combine many inference technologies into a single model, which can be deployed on the Triton Inference Server. All that is required is to wrap a module with a single line of code:

import model_navigator as nav

pipeline = Pipeline(...)
pipeline.model = nav.Module(pipeline.model)

pipeline(...)

They prepared an example for this feature for Stable Diffusion. It would be very helpful if you could test this feature and provide feedback for your use case.

Summary

I hope this helps you with your AI projects. Please let me know if you have any other questions or feedback. I'm always happy to hear from users of PyTriton.

Lzhang-hub commented 10 months ago

@Lzhang-hub if you look for automated config generator you may want to take a look at https://github.com/triton-inference-server/model_navigator. There is a functionality to create the Triton model store with generate config.pbtxt: https://triton-inference-server.github.io/model_navigator/0.7.1/triton/triton_deployment/

However, there is no ensemble support.

Thank you very much, this is what i'm looking for.

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 9 months ago

This issue was closed because it has been stalled for 7 days with no activity.

jinluyang commented 7 months ago
  • Which of ensemble features are useful in your scenario?
  • Do you want to use dynamic batching or response cache?
  • Do you want to use ensemble for combining models from different backends and devices?

Hello! Though this issue was closed, I think I am looking for the same ensemble feature. I'd like to answer your question about Which of ensemble features are useful in your scenario? I am seeking for methods that enhance GPU utilization (that in the nvidia-smi). In my case, on powerful GPUs, the pre-process and post-process of model which take place on CPUs, often slow down the application, making the GPU utilization about 60%. These pre-process and post-process functions, like cv2.resize(), should be part of the service, rather than part of the client, that is because my application demands so. So my goal is to ensemble these pre-process and post-process functions with the pytorch model. Do you suggest that I should use TIS python backend?

piotrm-nvidia commented 7 months ago

Thank you for your question. I understand that you want to improve the GPU utilization of your application by ensembling the pre-processing and post-processing functions with the PyTorch model. I recommend that you use PyTriton with DALI, which is a library that can perform pre-processing and post-processing on GPU efficiently. You can also use CV2 but it runs on CPU and can slow down the application.

PyTriton is a framework that allows you to define your inference pipeline using Python functions. You can use DALI to create functions that preprocess and postprocess your inputs and outputs on GPU. You can also use PyTorch to create your image processing model.

You can find an example of how to use PyTriton with DALI and PyTorch in the PyTriton repository. The file server.py shows how to define an inference pipeline for video segmentation. The function _infer_fn is the main entry point for the pipeline. It takes the encoded video as input and returns the original and segmented frames as output. The function preprocess uses DALI to decode and resize the video frames on GPU. The function segmentation uses PyTorch to perform the segmentation model on GPU. The function postprocess uses DALI to overlay the segmentation mask on the original frames on GPU. You can see how these functions are decorated with @batch to enable dynamic batching and improve performance.


@batch
def _infer_fn(**inputs):
    encoded_video = inputs["video"]

    image, input = preprocess(encoded_video)
    batch_size, frames_num = image.shape[:2]

    input = input.reshape(-1, *input.shape[-3:])  # NFCHW to NCHW (flattening first two dimensions)
    image = image.reshape(-1, *image.shape[-3:])  # NFHWC to NHWC (flattening first two dimensions)

    prob = segmentation(input)
    out = postprocess(image, prob)

    return {
        "original": image.cpu().numpy().reshape(batch_size, frames_num, *image.shape[-3:]),
        "segmented": out.as_cpu().as_array().reshape(batch_size, frames_num, *image.shape[-3:]),
    }

I hope this example helps you to achieve your goal. Please let me know if you have any questions or feedback.

jinluyang commented 7 months ago

Thank you for your question. I understand that you want to improve the GPU utilization of your application by ensembling the pre-processing and post-processing functions with the PyTorch model. I recommend that you use PyTriton with DALI, which is a library that can perform pre-processing and post-processing on GPU efficiently. You can also use CV2 but it runs on CPU and can slow down the application.

PyTriton is a framework that allows you to define your inference pipeline using Python functions. You can use DALI to create functions that preprocess and postprocess your inputs and outputs on GPU. You can also use PyTorch to create your image processing model.

You can find an example of how to use PyTriton with DALI and PyTorch in the PyTriton repository. The file server.py shows how to define an inference pipeline for video segmentation. The function _infer_fn is the main entry point for the pipeline. It takes the encoded video as input and returns the original and segmented frames as output. The function preprocess uses DALI to decode and resize the video frames on GPU. The function segmentation uses PyTorch to perform the segmentation model on GPU. The function postprocess uses DALI to overlay the segmentation mask on the original frames on GPU. You can see how these functions are decorated with @batch to enable dynamic batching and improve performance.

@batch
def _infer_fn(**inputs):
    encoded_video = inputs["video"]

    image, input = preprocess(encoded_video)
    batch_size, frames_num = image.shape[:2]

    input = input.reshape(-1, *input.shape[-3:])  # NFCHW to NCHW (flattening first two dimensions)
    image = image.reshape(-1, *image.shape[-3:])  # NFHWC to NHWC (flattening first two dimensions)

    prob = segmentation(input)
    out = postprocess(image, prob)

    return {
        "original": image.cpu().numpy().reshape(batch_size, frames_num, *image.shape[-3:]),
        "segmented": out.as_cpu().as_array().reshape(batch_size, frames_num, *image.shape[-3:]),
    }

I hope this example helps you to achieve your goal. Please let me know if you have any questions or feedback.

Hello~, thanks for your reply. At the moment I won't use DALI to do the work, because there are some other operations that must be on the CPU. And I'm also afraid that the interpolation on CPU and GPU may be different. And I recently find that the preprocess/postprocess may not be the only cause of low GPU utilization. The CPU is launching the kernels slower than the computation. That may also be a reason. In my opinion it seems there may not be a universal solution to guarantee ~100% GPU utilization.