triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.37k stars 1.49k forks source link

PyTorch model with Dictionary[Key,Tensor] output #7765

Closed cesumilo closed 4 days ago

cesumilo commented 2 weeks ago

Hi community, I'm sorry if I'm not writing this in the right place. I couldn't figure out where to ask a question about the triton inference server.

Description I have a PyTorch model with the following structure:

name: "my_super_model"
platform: "pytorch_libtorch"
max_batch_size: 0
input [
  {
    name: "input0"
    data_type: TYPE_FP32
    dims: [ -1, 160, 160, 3 ]
  }
]
output [
  {
    name: "output__0"
    data_type: TYPE_STRING
    dims: [ -1 ]
  },
  {
    name: "output__1"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output__2"
    data_type: TYPE_STRING
    dims: [ -1 ]
  },
  {
    name: "output__3"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output__4"
    data_type: TYPE_FP32
    dims: [ -1, 1024 ]
  }
]
response_cache {
  enable: true
}
instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]

The output of the model is a dictionary mapping string keys to lists of tensors.

When I try to run an inference with the following script, I get the error below.

import asyncio
import grpc
import tritonclient.grpc.aio as grpcclient
import numpy as np
import time

URL = "localhost:8001"

async def infer_torch():
    # Create gRPC stub for communicating with the server
    triton_client = grpcclient.InferenceServerClient(
        url=URL, verbose=False
    )

    model_name = f"my_super_model"
    print(f"Running {model_name}...")

    # Infer
    nb_objects = 600
    objects = np.random.rand(nb_objects, 160, 160, 3).astype(np.float32)

    m1_input = grpcclient.InferInput('input0', [ nb_objects, 160, 160, 3 ], "FP32")
    m1_input.set_data_from_numpy(objects)

    output = grpcclient.InferRequestedOutput("output__0")

    t1 = time.time()
    results = await triton_client.infer(
        model_name=model_name,
        inputs=[m1_input],
        outputs=[output],
    )
    print(f"Inference time: {time.time() - t1}s")

    statistics = await triton_client.get_inference_statistics(model_name=model_name)
    print(statistics)

    output = results.get_output("output__0") 

    print(f"output: {output}")

async def main():
    await asyncio.gather(infer_torch())

if __name__ == '__main__':
    asyncio.run(main())

Error:

PyTorch execute failure: output must be of type Tensor, List[str] or Tuple containing one of these two types. It should not be a List / Dictionary of Tensors or a Scalar

Triton Information Image: nvcr.io/nvidia/tritonserver:24.09-py3

Expected behavior I was expecting to get my output tensors from the dictionary by specifying the dictionary structure in the model configuration. I couldn't figure out from the documentation how to make this work.

Any idea how to make this work? 🙏

cesumilo commented 4 days ago

For anyone facing the same issue, the problem lies in the model's output. The PyTorch backend cannot handle models that return a dictionary of tensors as output. The solution I found was to use the Python backend to wrap the model and reshape the outputs into tensors. Alternatively, modifying the model's output format could also resolve the issue.