triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.39k stars 1.49k forks source link

Error: ensemble of tensorrt + python_be + tensorrt is supported on jetson? #7667

Open olivetom opened 1 month ago

olivetom commented 1 month ago

My setup is:

  1. jetson orin 32GB
  2. JetPack 6.0
  3. Triton 2.40 (NGC Container 23.11)
  4. Cuda 12.2, TensorRT 8.6.2
  5. Python Backend API 1.16

input_0: try to use CUDA copy while GPU is not supported

This is somewhat similar to the following issues still OPEN:

4772 5578

So my question is: do I have to build python backend with GPU support? what's the easiest way to check if my current python backend on jetson supports ensemble of this kind?

Thanks in advance,

olivetom commented 1 month ago

@oandreeva-nv Hi Olga, should I rephrase this issue to get help?

oandreeva-nv commented 1 month ago

Hi @olivetom , thanks for your question. There are certain limitations for Jetson: https://github.com/triton-inference-server/server/blob/main/docs/user_guide/jetson.md

The Python backend does not support GPU Tensors and Async BLS.
CUDA IPC (shared memory) is not supported. System shared memory however is supported.

I believe you are hitting one of those. Could you please share your model.py for easier debugging?

oandreeva-nv commented 1 month ago

Could you please also provide some steps on how you are building triton + circling back:

do I have to build python backend with GPU support?

yes, I would recommend trying this as well. We also publish NGC iGPU containers. Is this something you would potentially consider?

olivetom commented 1 month ago

Hi @oandreeva-nv, Here is my model.py that runs perfect on x86_64 but fails on jetson orin. Below I also shared my ensemble steps.

from typing import List, Tuple
import numpy as np
import numpy.typing as npt
import torch
import triton_python_backend_utils as pb_utils
from torch.nn.functional import pad

class TritonPythonModel:
    def initialize(self, args) -> None:
        self.logger = pb_utils.Logger
        self.cuda = torch.cuda.is_available()
        self.logger.log_info(f": initialize: CUDA available: {self.cuda}")
        self.logger.log_info(f": initialize: multiensemble model args: {args}")
        self.logger.log_info(f": initialize: Config: {self.config}")
        if not pb_utils.is_model_ready(
            model_name="multiemotion", model_version="1"
        ):  # this loads model latest version as well
            # Load the model from the model repository
            pb_utils.load_model(model_name="multiemotion")

    def execute(self, requests):  # type: ignore
        responses: List[pb_utils.InferenceResponse] = []

        for request in requests:
            images = pb_utils.get_input_tensor_by_name(request, "images").as_numpy()
            face_bboxes = pb_utils.get_input_tensor_by_name(
                request, "face_bboxes"
            ).as_numpy()
            face_scores = pb_utils.get_input_tensor_by_name(
                request, "face_scores"
            ).as_numpy()
            face_keypoints = pb_utils.get_input_tensor_by_name(
                request, "face_keypoints"
            ).as_numpy()
            face_classifications = pb_utils.get_input_tensor_by_name(
                request, "face_classifications"
            ).as_numpy()
            request_id: str = request.request_id()

            self.logger.log_info(
                f": execute: 'images' input shape={images.shape}"
            )
            self.logger.log_info(
                f": execute: 'face_bboxes' input shape={face_bboxes.shape}"
            )
            self.logger.log_info(
                f": execute: 'face_scores' input shape={face_scores.shape}"
            )
            self.logger.log_info(
                f": execute: 'face_keypoints' input shape={face_keypoints.shape}"
            )
            self.logger.log_info(
                f": execute: 'face_classifications' input shape={face_classifications.shape}"
            )
            self.logger.log_info(f": execute: REQUEST_ID={request_id}")

            self.batch_size: int = images.shape[0]
            self.logger.log_info(f": execute: Batch size={self.batch_size}")

            image_tensor: Torch.Tensor = (
                torch.tensor(images, dtype=torch.float32)
                if self.cuda
                else torch.tensor(images, dtype=torch.float32).cpu()
            )
            bounding_boxes: Torch.Tensor = (
                torch.tensor(face_bboxes, dtype=torch.float32)
                if self.cuda
                else torch.tensor(face_bboxes, dtype=torch.float32).cpu()
            )
            scores: Torch.Tensor = (
                torch.tensor(face_scores, dtype=torch.float32)
                if self.cuda
                else torch.tensor(face_scores, dtype=torch.float32).cpu()
            )
            keypoints: Torch.Tensor = (
                torch.tensor(face_keypoints, dtype=torch.float32)
                if self.cuda
                else torch.tensor(face_keypoints, dtype=torch.float32).cpu()
            )
            classifications: Torch.Tensor = (
                torch.tensor(face_classifications, dtype=torch.int32)
                if self.cuda
                else torch.tensor(face_classifications, dtype=torch.int32).cpu()
            )

            (
                filtered_face_bboxes,
                filtered_face_scores,
                filtered_face_keypoints,
                filtered_face_classifications,
                filtered_face_auc,
                filtered_face_bec,
                filtered_face_valence,
                filtered_face_arousal,
            ) = self.process_faces(
                image_tensor=image_tensor,
                bounding_boxes=bounding_boxes,
                scores=scores,
                score_threshold=self.config.multitask_face_score_threshold,
                min_size=self.config.multitask_min_face_size,
                target_size=(self.config.emotions_width, self.config.emotions_height),
                save_dir=None,
                request_id=request_id,
                keypoints=keypoints,
                classifications=classifications,
            )

            self.logger.log_info(
                f": execute: filtered_face_bboxes shape={filtered_face_bboxes.shape}"
            )
            self.logger.log_info(
                f": execute: filtered_face_scores shape={filtered_face_scores.shape}"
            )
            self.logger.log_info(
                f": execute: filtered_face_keypoints shape={filtered_face_keypoints.shape}"
            )
            self.logger.log_info(
                f": execute: filtered_face_classifications shape={filtered_face_classifications.shape}"
            )
            self.logger.log_info(
                f": execute: filtered_face_aucs shape={filtered_face_auc.shape}"
            )
            self.logger.log_info(
                f": execute: filtered_face_becs shape={filtered_face_bec.shape}"
            )
            self.logger.log_info(
                f": execute: filtered_face_valences shape={filtered_face_valence.shape}"
            )
            self.logger.log_info(
                f": execute: filtered_face_arousals shape={filtered_face_arousal.shape}"
            )

            filtered_face_bboxes = pb_utils.Tensor(
                "filtered_face_bboxes", filtered_face_bboxes.numpy()
            )
            filtered_face_scores = pb_utils.Tensor(
                "filtered_face_scores", filtered_face_scores.numpy()
            )
            filtered_face_keypoints = pb_utils.Tensor(
                "filtered_face_keypoints", filtered_face_keypoints.numpy()
            )
            filtered_face_classifications = pb_utils.Tensor(
                "filtered_face_classifications", filtered_face_classifications.numpy()
            )
            filtered_face_aucs = pb_utils.Tensor(
                "filtered_face_aucs", filtered_face_auc.numpy()
            )
            filtered_face_becs = pb_utils.Tensor(
                "filtered_face_becs", filtered_face_bec.numpy()
            )
            filtered_face_valences = pb_utils.Tensor(
                "filtered_face_valences", filtered_face_valence.numpy()
            )
            filtered_face_arousals = pb_utils.Tensor(
                "filtered_face_arousals", filtered_face_arousal.numpy()
            )

            multiemotion_response = pb_utils.InferenceResponse(
                output_tensors=[
                    filtered_face_bboxes,
                    filtered_face_scores,
                    filtered_face_keypoints,
                    filtered_face_classifications,
                    filtered_face_aucs,
                    filtered_face_becs,
                    filtered_face_valences,
                    filtered_face_arousals,
                ]
            )
            responses.append(multiemotion_response)

        self.logger.log_info(f": Number of responses: {len(responses)}.")
        return responses

    def process_faces(
        self,
        image_tensor: torch.Tensor,  # [N, C, H, W] FP32
        bounding_boxes: torch.Tensor,  # [N, num_boxes, 4] in (x1, y1, x2, y2) format
        scores: torch.Tensor,  # [N, num_boxes] FP32
        score_threshold: float = 0.95,
        min_size: int = 75,
        target_size: Tuple[int, int] = (100, 100),
        save_dir: str = "/tmp/saved_faces",  # Directory to save the images
        request_id: str = "unique_id",
        keypoints: torch.Tensor = None,
        classifications: torch.Tensor = None,
    ) -> Tuple[torch.Tensor, ...]:
        N, C, H, W = image_tensor.shape
        all_filtered_bboxes = [[] for _ in range(N)]
        all_filtered_scores = [[] for _ in range(N)]
        all_filtered_keypoints = [[] for _ in range(N)]
        all_filtered_classifications = [[] for _ in range(N)]
        all_filtered_auc = [[] for _ in range(N)]
        all_filtered_bec = [[] for _ in range(N)]
        all_filtered_valence = [[] for _ in range(N)]
        all_filtered_arousal = [[] for _ in range(N)]
        all_cropped_faces = [[] for _ in range(N)]

        self.logger.log_info(
            f": process_faces: image_tensor shape={image_tensor.shape}"
        )
        self.logger.log_info(
            f": process_faces: bounding_bboxes shape={bounding_boxes.shape}"
        )
        self.logger.log_info(f": process_faces: scores shape={scores.shape}")
        self.logger.log_info(
            f": process_faces: keypoints shape={keypoints.shape}"
        )
        self.logger.log_info(
            f": process_faces: classifications shape={classifications.shape}"
        )

        total_cropped_faces = 0
        for i in range(N):
            image = image_tensor[i]  # [C, H, W]
            max_faces = bounding_boxes.shape[1]
            boxes = bounding_boxes[i]  # [num_boxes, 4]
            scores_i = scores[i].squeeze(-1)  # [num_boxes]
            keypoints_i = keypoints[i]  # [num_boxes, 5, 3]
            classifications_i = classifications[i]  # [num_boxes]

            # 1. Filter out bounding boxes based only on score and size
            score_mask = scores_i >= score_threshold
            widths = boxes[:, 2] - boxes[:, 0]
            heights = boxes[:, 3] - boxes[:, 1]
            size_mask = (widths >= min_size) & (heights >= min_size)
            final_mask = score_mask & size_mask

            boxes = boxes[final_mask]
            scores_i = scores_i[final_mask].unsqueeze(-1)  # [num_boxes, 1]
            keypoints_i = keypoints_i[final_mask]
            classifications_i = classifications_i[final_mask]

            # 2. Clip bounding boxes to image boundaries
            boxes[:, [0, 2]] = boxes[:, [0, 2]].clamp(0, W)
            boxes[:, [1, 3]] = boxes[:, [1, 3]].clamp(0, H)

            # 3. Crop the faces from the images based on the bounding boxes
            for box in boxes:
                x1, y1, x2, y2 = box.int()
                face = image[:, y1:y2, x1:x2]  # Crop the face

                # 4. Resize the cropped faces to a target size with padding
                face_resized = self.resize_with_padding(face, target_size)
                all_cropped_faces[i].append(face_resized)
                total_cropped_faces += 1

            # Stack the processed faces for the current image
            if all_cropped_faces[i]:
                self.logger.log_info(
                    f": process_faces: cropped_faces length={len(all_cropped_faces[i])}"
                )
                faces_tensor: Torch.Tensor = (
                    torch.stack(all_cropped_faces[i])
                    if self.cuda
                    else torch.stack(all_cropped_faces[i]).cpu()
                )
                self.logger.log_info(
                    f": process_faces: faces_tensor shape={faces_tensor.shape}"
                )

                auc, bec, valence, arousal = self.get_emotion_tensors(
                    faces_tensor=faces_tensor,
                    request_id=request_id,
                    max_batch_size=self.config.emotions_max_batch_size,
                )

                # Append the results to the lists
                all_filtered_bboxes[i].extend(boxes)
                all_filtered_scores[i].extend(scores_i)
                all_filtered_keypoints[i].extend(keypoints_i)
                all_filtered_classifications[i].extend(classifications_i)
                all_filtered_auc[i].extend(auc)
                all_filtered_bec[i].extend(bec)
                all_filtered_valence[i].extend(valence)
                all_filtered_arousal[i].extend(arousal)

            current_faces = len(all_cropped_faces[i])
            current_bboxes = len(all_filtered_bboxes[i])
            assert (
                current_faces == current_bboxes
            ), f"current_faces={current_faces} != current_bboxes={current_bboxes}"
            self.logger.log_info(
                f": process_faces: current_faces={current_faces}"
            )
            padding = max_faces - current_faces
            do_padding = padding < max_faces and padding > 0
            self.logger.log_info(f": process_faces: padding={padding}")
            self.logger.log_info(f": process_faces: do_padding={do_padding}")
            if do_padding:
                self.logger.log_info(
                    f": process_faces: all_filtered_bboxes[i] len= {len(all_filtered_bboxes[i])}"
                )
                self.logger.log_info(
                    f": process_faces: all_filtered_bboxes[i] = {all_filtered_bboxes[i]}"
                )
                all_filtered_bboxes[i] = torch.stack(
                    [_ for _ in all_filtered_bboxes[i]]
                )
                self.logger.log_info(
                    f": process_faces: tensor all_filtered_bboxes[i] size={all_filtered_bboxes[i].size()}"
                )

                self.logger.log_info(
                    f": process_faces: all_filtered_scores[i] = {all_filtered_scores[i]}"
                )
                all_filtered_scores[i] = torch.stack(
                    [_ for _ in all_filtered_scores[i]]
                )
                self.logger.log_info(
                    f": process_faces: tensor all_filtered_scores[i] size={all_filtered_scores[i].size()}"
                )

                self.logger.log_info(
                    f": process_faces: all_filtered_keypoints[i] = {all_filtered_keypoints[i]}"
                )
                all_filtered_keypoints[i] = torch.stack(
                    [_ for _ in all_filtered_keypoints[i]]
                )
                self.logger.log_info(
                    f": process_faces: tensor all_filtered_keypoints[i] size={all_filtered_keypoints[i].size()}"
                )

                self.logger.log_info(
                    f": process_faces: all_filtered_classifications[i] = {all_filtered_classifications[i]}"
                )
                all_filtered_classifications[i] = torch.stack(
                    [_ for _ in all_filtered_classifications[i]]
                )
                self.logger.log_info(
                    f": process_faces: tensor all_filtered_classifications[i] size={all_filtered_classifications[i].size()}"
                )

                all_filtered_auc[i] = torch.tensor(np.stack(all_filtered_auc[i]))
                all_filtered_bec[i] = torch.tensor(np.stack(all_filtered_bec[i]))
                all_filtered_valence[i] = torch.tensor(
                    np.stack(all_filtered_valence[i])
                )
                all_filtered_arousal[i] = torch.tensor(
                    np.stack(all_filtered_arousal[i])
                )

                all_filtered_bboxes[i] = pad(
                    all_filtered_bboxes[i], (0, 0, 0, padding), value=0.0
                )
                all_filtered_scores[i] = pad(
                    all_filtered_scores[i], (0, 0, 0, padding), value=0.0
                )
                all_filtered_keypoints[i] = pad(
                    all_filtered_keypoints[i], (0, 0, 0, 0, 0, padding), value=0.0
                )
                all_filtered_classifications[i] = pad(
                    all_filtered_classifications[i], (0, padding), value=0
                )
                all_filtered_auc[i] = pad(
                    all_filtered_auc[i], (0, 0, 0, padding), value=0.0
                )
                all_filtered_bec[i] = pad(
                    all_filtered_bec[i], (0, 0, 0, padding), value=0.0
                )
                all_filtered_valence[i] = pad(
                    all_filtered_valence[i], (0, 0, 0, padding), value=0.0
                )
                all_filtered_arousal[i] = pad(
                    all_filtered_arousal[i], (0, 0, 0, padding), value=0.0
                )
            else:
                all_filtered_bboxes[i] = torch.zeros(
                    (max_faces, 4), dtype=torch.float32
                )
                all_filtered_scores[i] = torch.zeros(
                    (max_faces, 1), dtype=torch.float32
                )
                all_filtered_keypoints[i] = torch.zeros(
                    (max_faces, 5, 3), dtype=torch.float32
                )
                all_filtered_classifications[i] = torch.zeros(
                    (max_faces), dtype=torch.int32
                )
                all_filtered_auc[i] = torch.zeros((max_faces, 8), dtype=torch.float32)
                all_filtered_bec[i] = torch.zeros((max_faces, 7), dtype=torch.float32)
                all_filtered_valence[i] = torch.zeros(
                    (max_faces, 1), dtype=torch.float32
                )
                all_filtered_arousal[i] = torch.zeros(
                    (max_faces, 1), dtype=torch.float32
                )

        [
            self.logger.log_info(
                f": process_faces: all_filtered_bboxes[{i}] shape={all_filtered_bboxes[i].shape}"
            )
            for i in range(N)
        ]
        [
            self.logger.log_info(
                f": process_faces: all_filtered_scores[{i}] shape={all_filtered_scores[i].shape}"
            )
            for i in range(N)
        ]
        [
            self.logger.log_info(
                f": process_faces: all_filtered_keypoints[{i}] shape={all_filtered_keypoints[i].shape}"
            )
            for i in range(N)
        ]
        [
            self.logger.log_info(
                f": process_faces: all_filtered_classifications[{i}] shape={all_filtered_classifications[i].shape}"
            )
            for i in range(N)
        ]
        [
            self.logger.log_info(
                f": process_faces: all_filtered_auc[{i}] shape={all_filtered_auc[i].shape}"
            )
            for i in range(N)
        ]
        [
            self.logger.log_info(
                f": process_faces: all_filtered_bec[{i}] shape={all_filtered_bec[i].shape}"
            )
            for i in range(N)
        ]
        [
            self.logger.log_info(
                f": process_faces: all_filtered_valence[{i}] shape={all_filtered_valence[i].shape}"
            )
            for i in range(N)
        ]
        [
            self.logger.log_info(
                f": process_faces: all_filtered_arousal[{i}] shape={all_filtered_arousal[i].shape}"
            )
            for i in range(N)
        ]

        padded_bboxes = torch.stack(all_filtered_bboxes)
        padded_scores = torch.stack(all_filtered_scores)
        padded_keypoints = torch.stack(all_filtered_keypoints)
        padded_classifications = torch.stack(all_filtered_classifications)
        padded_auc = torch.stack(all_filtered_auc)
        padded_bec = torch.stack(all_filtered_bec)
        padded_valence = torch.stack(all_filtered_valence)
        padded_arousal = torch.stack(all_filtered_arousal)

        # all_filtered* is a list of lists where each sublist contains the bboxes, etc for a single image in the batch
        # all_filtered_bboxes = [[bboxes11, bboxes12, ...], [bboxes21, bboxes22, bboxes23, ...], ...]
        # all_filtered_scores = [[scores11, scores12, ...], [scores21, scores22, scores23, ...], ...]
        # an so on for all 6 other lists (keypoints, classifications, auc, bec, valence, arousal)
        # we should pad each sublist to the same length to create 8 tensors with shapes
        # for bboxes: (N, 128, 4)
        # for scores: (N, 128, 1)
        # for keypoints: (N, 128, 5, 3)
        # for classifications: (N, 128, 1) of int32
        # for auc: (N, 128, 8)
        # for bec: (N, 128, 7)
        # for valence: (N, 128, 1)
        # for arousal: (N, 128, 1)
        # where N is the batch size and 128 is the maximum number of faces in each image of the batch

        # Ensure the padded shapes match the expected dimensions
        assert padded_bboxes.shape == (
            N,
            max_faces,
            4,
        ), f"padded_bboxes.shape={padded_bboxes.shape} != (N, 128, 4)"
        assert padded_scores.shape == (
            N,
            max_faces,
            1,
        ), f"padded_scores.shape={padded_scores.shape} != (N, 128, 1)"
        assert padded_keypoints.shape == (
            N,
            max_faces,
            5,
            3,
        ), f"padded_keypoints.shape={padded_keypoints.shape} != (N, 128, 5, 3)"
        assert padded_classifications.shape == (
            N,
            max_faces,
        ), f"padded_classifications.shape={padded_classifications.shape} != (N, 128, 1)"
        assert padded_auc.shape == (
            N,
            max_faces,
            8,
        ), f"padded_auc.shape={padded_auc.shape} != (N, 128, 8)"
        assert padded_bec.shape == (
            N,
            max_faces,
            7,
        ), f"padded_bec.shape={padded_bec.shape} != (N, 128, 7)"
        assert padded_valence.shape == (
            N,
            max_faces,
            1,
        ), f"padded_valence.shape={padded_valence.shape} != (N, 128, 1)"
        assert padded_arousal.shape == (
            N,
            max_faces,
            1,
        ), f"padded_arousal.shape={padded_arousal.shape} != (N, 128, 1)"

        return (
            padded_bboxes,
            padded_scores,
            padded_keypoints,
            padded_classifications,
            padded_auc,
            padded_bec,
            padded_valence,
            padded_arousal,
     )

    def get_emotion_tensors(
        self,
        faces_tensor: torch.Tensor,
        max_batch_size: int = 12,
        request_id: str = "unique_id",
    ) -> torch.Tensor:
        num_faces = faces_tensor.shape[0]
        self.logger.log_info(
            f": get_emotion_tensors: faces_tensor shape={faces_tensor.shape}"
        )
        self.logger.log_info(
            f": get_emotion_tensors: faces_tensor device={faces_tensor.device}"
        )

        auc_output: np.ndarray[np.float32] = np.zeros((num_faces, 8))
        bec_output: npt.NDArray[np.float32] = np.zeros((num_faces, 7))
        valence_output: npt.NDArray[np.float32] = np.zeros((num_faces, 1))
        arousal_output: npt.NDArray[np.float32] = np.zeros((num_faces, 1))

        # Process the faces in batches, honoring the max_batch_size
        for i, start_idx in enumerate(range(0, num_faces, max_batch_size)):
            end_idx = min(start_idx + max_batch_size, num_faces)
            inference_request = pb_utils.InferenceRequest(
                request_id=request_id + str(i),
                model_name="multiemotion",
                requested_output_names=[
                    "action_units_classifications",
                    "basic_emotions_classifications",
                    "valence",
                    "arousal",
                ],
                inputs=[
                    pb_utils.Tensor("images", faces_tensor[start_idx:end_idx].numpy())
                ],
                preferred_memory=pb_utils.PreferredMemory(
                    pb_utils.TRITONSERVER_MEMORY_CPU, 0
                ),
            )

            inference_response: pb_utils.InferenceResponse
            inference_response = inference_request.exec()
            # Check if the inference response has an error
            if inference_response.has_error():
                raise pb_utils.TritonModelException(
                    inference_response.error().message()
                )

            # accumulate results for all requested outputs
            auc_output[start_idx:end_idx] = pb_utils.get_output_tensor_by_name(
                inference_response, "action_units_classifications"
            ).as_numpy()
            bec_output[start_idx:end_idx] = pb_utils.get_output_tensor_by_name(
                inference_response, "basic_emotions_classifications"
            ).as_numpy()
            valence_output[start_idx:end_idx] = pb_utils.get_output_tensor_by_name(
                inference_response, "valence"
            ).as_numpy()
            arousal_output[start_idx:end_idx] = pb_utils.get_output_tensor_by_name(
                inference_response, "arousal"
            ).as_numpy()
        self.logger.log_info(
            f": get_emotion_tensors: AUC output shape={auc_output.shape}"
        )
        self.logger.log_info(
            f": get_emotion_tensors: BEC output shape={bec_output.shape}"
        )
        self.logger.log_info(
            f": get_emotion_tensors: Valence output shape={valence_output.shape}"
        )
        self.logger.log_info(
            f": get_emotion_tensors: Arousal output shape={arousal_output.shape}"
        )
        return auc_output, bec_output, valence_output, arousal_output

And here my ensemble steps:

name: "multiensemble"
platform: "ensemble"
max_batch_size: 12
parameters [
  { 
    key: "FORCE_CPU_ONLY_INPUT_TENSORS"
    value: {string_value: "no"}
  }
]
# multiensemble input = multitask input
input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [3, 405, 720]
  }
]

output [
  {
      name: "filtered_face_bboxes",
      data_type: TYPE_FP32,
      dims: [128, 4],
  },
  {
      name: "filtered_face_scores",
      data_type: TYPE_FP32,
      dims: [128, 1],
  },
  {
      name: "filtered_face_keypoints",
      data_type: TYPE_FP32,
      dims: [128, 5, 3],
  },
  {
      name: "filtered_face_classifications",
      data_type: TYPE_INT32,
      dims: [128],
  },
  {
       name: "filtered_face_aucs",
       data_type: TYPE_FP32,
       dims: [128, 8],
  },
  {
       name: "filtered_face_becs",
       data_type: TYPE_FP32,
       dims: [128, 7],
   },
     {
       name: "filtered_face_valences",
       data_type: TYPE_FP32,
       dims: [128, 1],
   },
     {
       name: "filtered_face_arousals",
       data_type: TYPE_FP32,
       dims: [128, 1],
   },
  {
      name: "body_bboxes",
      data_type: TYPE_FP32,
      dims: [128, 4],
  },
  {
      name: "body_joints_keypoints",
      data_type: TYPE_FP32,
      dims: [128, 12, 3],
  },
  {
      name: "body_scores",
      data_type: TYPE_FP32,
      dims: [128, 1],
  },
  {
     name: "body_action_scores",
     data_type: TYPE_FP32,
     dims: [128, 5],
 },
 {
     name: "body_fall_scores",
     data_type: TYPE_FP32,
     dims: [128, 2],
 },
 {
     name: "body_classifications",
     data_type: TYPE_INT32,
     dims: [128],
 },
 {
     name: "body_type_scores",
     data_type: TYPE_FP32,
     dims: [128, 3],
 },
 {
     name: "furniture_bboxes",
     data_type: TYPE_FP32,
     dims: [128, 4],
 },
 {
     name: "furniture_keypoints",
     data_type: TYPE_FP32,
     dims: [128, 8, 3],
 },
 {
     name: "furniture_scores",
     data_type: TYPE_FP32,
     dims: [128, 2],
 },
 {
     name: "furniture_classifications",
     data_type: TYPE_INT32,
     dims: [128],
 }
]

ensemble_scheduling {
  step [
  # multitask
    {
      model_name: "multitask"
      model_version: -1
      input_map {
        key: "images"
        value: "images"
      }
      output_map {
        key: "face_bboxes"
        value: "face_bboxes"
      }
      output_map {
        key: "face_scores"
        value: "face_scores"
      }
      output_map {
        key: "face_keypoints"
        value: "face_keypoints"
      }
      output_map {
        key: "face_classifications"
        value: "face_classifications"
      }
      output_map {
        key: "body_bboxes"
        value: "body_bboxes"
      }
      output_map {
        key: "body_joints_keypoints"
        value: "body_joints_keypoints"
      }
      output_map {
        key: "body_scores"
        value: "body_scores"
      }
        output_map {
        key: "body_action_scores"
        value: "body_action_scores"
      }
      output_map {
        key: "body_fall_scores"
        value: "body_fall_scores"
      }
      output_map {
        key: "body_classifications"
        value: "body_classifications"
      }
      output_map {
        key: "body_type_scores"
        value: "body_type_scores"
      },
      output_map {
        key: "furniture_bboxes"
        value: "furniture_bboxes"
      },
      output_map {
        key: "furniture_keypoints"
        value: "furniture_keypoints"
      }
      output_map {
        key: "furniture_scores"
        value: "furniture_scores"
      }
      output_map {
        key: "furniture_classifications"
        value: "furniture_classifications"
      }
    },
  # multiemotion_preprocess
    {
      model_name: "multiemotion_preprocess"
      model_version: -1
      input_map {
        key: "images"
        value: "images"
      }
      input_map {
        key: "face_bboxes"
        value: "face_bboxes"
      }
      input_map {
        key: "face_scores"
        value: "face_scores"
      }
      input_map {
        key: "face_keypoints"
        value: "face_keypoints"
      }
      input_map {
        key: "face_classifications"
        value: "face_classifications"
      }
      #_________ OUTPUTS ___________#
      output_map {
        key: "filtered_face_bboxes"
        value: "filtered_face_bboxes"
      }
      output_map {
        key: "filtered_face_scores"
        value: "filtered_face_scores"
      }
      output_map {
        key: "filtered_face_keypoints"
        value: "filtered_face_keypoints"
      }
      output_map {
        key: "filtered_face_classifications"
        value: "filtered_face_classifications"
      }
      output_map {
        key: "filtered_face_aucs"
        value: "filtered_face_aucs"
      }
      output_map {
        key: "filtered_face_becs"
        value: "filtered_face_becs"
      }
      output_map {
        key: "filtered_face_arousals"
        value: "filtered_face_arousals"
      }
      output_map {
        key: "filtered_face_valences"
        value: "filtered_face_valences"
      }
    }
  ]
}
olivetom commented 1 month ago

Could you please also provide some steps on how you are building triton + circling back:

do I have to build python backend with GPU support?

yes, I would recommend trying this as well. We also publish NGC iGPU containers. Is this something you would potentially consider?

I'm using vanilla deepstream triton NGC container nvcr.io/nvidia/deepstream:6.4-triton-multiarch. So I guess this triton should support Jetson out-of-the-box...

oandreeva-nv commented 1 month ago

I'm unaware about what deepstream container support, to be honest. Could you try nvcr.io/nvidia/tritonserver:24.09-py3-igpu ?

olivetom commented 1 month ago

@oandreeva-nv what if I use python backend DLPack as here: https://github.com/triton-inference-server/python_backend/blob/main/README.md#interoperability-and-gpu-support) instead of my current implementation:

images = pb_utils.get_input_tensor_by_name(request, "images").as_numpy()
olivetom commented 1 month ago

I'm unaware about what deepstream container support, to be honest. Could you try nvcr.io/nvidia/tritonserver:24.09-py3-igpu ?

Will try and let you know the results. Thanks

olivetom commented 1 month ago

Hi @oandreeva-nv,

Using nvcr.io/nvidia/tritonserver:24.09-py3-igpu throws the same error : input_0: try to use CUDA copy while GPU is not supported

So, effectively triton python backend on jetson doesn't support GPU tensors (which is hard to swallow). But the following link says the contrary, unless this link is related to x64 architectures only: https://github.com/triton-inference-server/python_backend/blob/main/README.md#input-tensor-device-placement

Sadly, it's confusing... at least for me... :cry:

oandreeva-nv commented 1 month ago

Hi @olivetom , apologies for confusion. We'll try our best to update documentation. For jetson we have a dedicated page: https://github.com/triton-inference-server/server/blob/main/docs/user_guide/jetson.md

olivetom commented 1 month ago

@oandreeva-nv

Hi @olivetom , apologies for confusion. We'll try our best to update documentation. For jetson we have a dedicated page: https://github.com/triton-inference-server/server/blob/main/docs/user_guide/jetson.md

@oandreeva-nv this jetson docs refer to JetPack 5.0 from August 2022... Is there an updated version for JP 6.0?

oandreeva-nv commented 1 month ago

I believe everything still stands for JP 6.0 as weel, but @nv-kmcgill53 may correct me if I'm wrong