triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.77k stars 1.42k forks source link

Generating patches of image on server + Dynamic Batching #7402

Open j-sheikh opened 1 week ago

j-sheikh commented 1 week ago

Description I use a model ensemble with 3 models: pre-processor, inference model and post-processor. I want to send one image to the server and generate n patches of the given image in the pre-processor model, which should then be further processed with dynamic batching for the inference model. If I am correct, Triton expects a 1-to-1 mapping (request-response) and I am wondering if this is even possible to do.

Triton Information Triton Server Version 2.42.0

Configs: name: "ensemble" platform: "ensemble"

input [ { name: "input_image" data_type: TYPE_FP32 dims: [ 224, 224, 3 ] # Image dimensions (height, width, channels) } ]

output [ { name: "output_image" data_type: TYPE_FP32 dims: [ 224, 224, 3 ] # Image dimensions (height, width, channels) } ]

ensemble_scheduling { step [ { model_name: "pre_processor" model_version: -1 input_map { key: "pre_proc_input" value: "input_image" } output_map { key: "patch_output" value: "patch_output" } output_map { key: "patch_metadata" value: "patch_metadata" } }, { model_name: "simple_model" model_version: -1 input_map { key: "simple_model_input" value: "patch_output" } output_map { key: "simple_model_output" value: "simple_model_output" } }, { model_name: "post_processor" model_version: -1 input_map { key: "pos_proc_input" value: "simple_model_output" } input_map { key: "metadata_input" value: "patch_metadata" } output_map { key: "pos_proc_output" value: "output_image" } } ] } name: "pre_processor" backend: "python"

input [ { name: "pre_proc_input" data_type: TYPE_FP32 dims: [224, 224, 3] # Image dimensions (height, width, channels) } ]

output [ { name: "patch_output" data_type: TYPE_FP32 dims: [ -1, 112, 112, 3] # Batch of patches (batch_size, patch_height, patch_width, channels) }, { name: "patch_metadata" data_type: TYPE_STRING dims: [-1] # Metadata as serialized strings } ]

instance_group [ { kind: KIND_CPU, count: 1 } ] name: "simple_model" backend: "python" max_batch_size: 2 version_policy: { latest { num_versions: 1 } }

input [ { name: "simple_model_input" data_type: TYPE_FP32 dims: [112, 112, 3 ] # Batch of patches (batch, patch_height, patch_width, channels) } ]

output [ { name: "simple_model_output" data_type: TYPE_FP32 dims: [112, 112, 3 ] # Batch of patches (patch_height, patch_width, channels) } ]

instance_group [ { kind: KIND_GPU, count: 1 } ]

dynamic_batching { preferred_batch_size: [ 1, 2 ] max_queue_delay_microseconds: 100 } name: "post_processor" backend: "python" version_policy: { latest { num_versions: 1 } }

input [ { name: "pos_proc_input" data_type: TYPE_FP32 dims: [ -1, 112, 112, 3 ] # Batch of patches (batch_size, patch_height, patch_width, channels) }, { name: "metadata_input" data_type: TYPE_STRING dims: [ -1 ] # Metadata as serialized strings } ]

output [ { name: "pos_proc_output" data_type: TYPE_FP32 dims: [ 224, 224, 3 ] # Original image dimensions } ]

instance_group [ { kind: KIND_CPU, # Use CPU or GPU as needed count: 1 } ] Expected behavior Send one image to server. In the pre-processor split into n-patches. Batch together patches and send to the inference-model. Ideally i would later increase the number of instances to process more batches of patches simultaneously. In post-processor combine patches again to one image and send back to client (or when using async every processed patch directly).

j-sheikh commented 1 week ago

I discovered that we can avoid the 1:1 mapping with decoupled models. I was able to integrate this into my pipeline and send the patches with individual threads, but was not able to use the dynamic batcher in the inference model, even though i defined it in the config. Do you have some suggestions besides integrating a batching inside the models itself ?