triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.79k stars 1.42k forks source link

Tritonserver hangs on launch with python backend #7268

Closed JamesBowerXanda closed 1 week ago

JamesBowerXanda commented 1 month ago

Description I am trying to use the triton server on cpu only model and during launch the server will launch perfectly with only ONNX models but the moment I include a python backend model it hangs on launch eternally.

I am using an Apple M2 Mac.

It is worth noting that the model runs when I use the Sagemaker Triton Server Image on a sagemaker multimodel endpoint.

Triton Information

Version? 23.02 although have also tried 24.04.

Are you using the Triton container or did you build it yourself? Container. Specifically nvcr.io/nvidia/tritonserver:23.02-py3.

To Reproduce

  1. Pull nvcr.io/nvidia/tritonserver:23.02-py3
  2. Create models_repository with a python backend model.
  3. Launch the docker container without starting the server using docker run -it -p8000:8000 -p8001:8001 -p8002:8002 -v/Users/jamesbower/Projects/triton-local/model_repository:/models nvcr.io/nvidia/tritonserver:23.02-py3 /bin/bash
  4. Install python packages as system packages pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu and pip install --no-cache-dir numpy.
  5. CD to where models directory is loacated.
  6. Run tritonserver --model-repository models/

Output

The following is displayed:

W0524 08:06:50.694589 82 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: no CUDA-capable device is detected
I0524 08:06:50.694772 82 cuda_memory_manager.cc:115] CUDA memory pool disabled
I0524 08:06:50.705575 82 model_lifecycle.cc:459] loading: baai_quant_onnx:1
I0524 08:06:50.706361 82 model_lifecycle.cc:459] loading: forced_alignment:1
I0524 08:06:50.707123 82 model_lifecycle.cc:459] loading: titanet_small_onnx:1
I0524 08:06:50.707594 82 onnxruntime.cc:2459] TRITONBACKEND_Initialize: onnxruntime
I0524 08:06:50.707624 82 onnxruntime.cc:2469] Triton TRITONBACKEND API version: 1.11
I0524 08:06:50.707628 82 onnxruntime.cc:2475] 'onnxruntime' TRITONBACKEND API version: 1.11
I0524 08:06:50.707632 82 onnxruntime.cc:2505] backend configuration:
{"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
I0524 08:06:50.718814 82 onnxruntime.cc:2563] TRITONBACKEND_ModelInitialize: baai_quant_onnx (version 1)
I0524 08:06:50.719420 82 onnxruntime.cc:666] skipping model configuration auto-complete for 'baai_quant_onnx': inputs and outputs already specified

It just hangs here eternally.

Expected behavior Triton server launched completely such that curl -v localhost:8000/v2/health/ready receives a status 200 response.

Model Repository Setup

The structure of the model repository is:

model_repository/
|
|--baai_quant_onnx
|  |--1
|  |  |--model.onnx
|  |  |--labels.json
|  |  |--wav2vec2_asr_base_960h.pt
|  |--config.pbtxt
|
|--titanet_small_onnx
|  |--1
|  |  |-model.onnx
|  |--config.pbtxt
|
|--forced_alignment
|  |--1
|  |  |-model.py
|  |--config.pbtxt

I am not using a conda packed execution environment since I install the required packages in the container after launching it. I have also tried with a conda packed conda env though which is the method I used with the SageMaker Triton Server image.

The model.py file is

import triton_python_backend_utils as pb_utils
import numpy as np
import json
import torch
import re
import os
from dataclasses import dataclass

class TritonPythonModel:

    def initialize(self, args):
        self.word_output_type = pb_utils.triton_string_to_numpy(
            pb_utils.get_output_config_by_name(json.loads(args["model_config"]), "word")["data_type"]
        )
        self.start_time_output_type = pb_utils.triton_string_to_numpy(
            pb_utils.get_output_config_by_name(json.loads(args["model_config"]), "start_time")["data_type"]
        )
        self.end_time_output_type = pb_utils.triton_string_to_numpy(
            pb_utils.get_output_config_by_name(json.loads(args["model_config"]), "end_time")["data_type"]
        )

        model_repository = args["model_repository"]
        wav2vec2_path = os.path.join(model_repository,"1","wav2vec2_asr_base_960h.pt")
        labels_path = os.path.join(model_repository,"1","labels.json")

        self.wav2vecmodel = torch.jit.load(wav2vec2_path).eval()
        with open(labels_path, "r") as f:
            self.labels = tuple(json.load(f))

    def execute(self, requests):

        responses = []

        for request in requests:
            transcription = pb_utils.get_input_tensor_by_name(request, "transcription").as_numpy().squeeze(1).astype(str)
            audio = pb_utils.get_input_tensor_by_name(request, "audio").as_numpy()            

            # Optional preprocessing code for inputs in standard Python...
            transcription = transcription.tolist()[0]
            transcript = convert_to_transcript(transcription)

            with torch.inference_mode():
                waveform = torch.tensor(audio, dtype=torch.float32)
                emissions = calculate_emissions(self.wav2vecmodel, waveform)

            emissions = emissions[0].cpu().detach().numpy()
            waveform = waveform.cpu().detach().numpy()
            dictionary = {c: i for i, c in enumerate(self.labels)}
            tokens = [dictionary[c] for c in transcript]
            ratio = waveform.shape[1] / emissions.shape[0]

            trellis = get_trellis_numba(emissions, np.array(tokens))

            path = backtrack_numba(trellis, emissions, tokens)
            path = [Point(*p) for p in path]
            segments = merge_repeats(path, transcript, ratio)
            word_segments = merge_words(segments)

            words = []
            start_times = []
            end_times = []

            for segment in word_segments:
                words.append(segment.label)
                start_times.append(segment.start_time)
                end_times.append(segment.end_time)

            words = np.array([words]).astype(self.word_output_type)
            start_times = np.array([start_times]).astype(self.start_time_output_type)
            end_times = np.array([end_times]).astype(self.end_time_output_type)

            output_tensor_words = pb_utils.Tensor("word", words)
            output_tensor_start_times = pb_utils.Tensor("start_time", start_times)
            output_tensor_end_times = pb_utils.Tensor("end_time", end_times)

            response = pb_utils.InferenceResponse(
                output_tensors=[output_tensor_words, output_tensor_start_times, output_tensor_end_times]
            )

            responses.append(response)

        return responses

    # Any cleanup code to be used when the model is unloaded. Not completely sure of the degree to which this is required currently.
    def finalize(self):
        print("Finalizing model...")

def convert_to_transcript(text: str):
    text = text.upper().strip()
    text = re.sub(r"[^A-Z0-9\s]", "", text)
    text = "|" + re.sub(r"\s+", "|", text) + "|"
    return text    

def calculate_emissions(wav2vecmodel, waveform) -> torch.Tensor:
    emissions, _ = wav2vecmodel(waveform)
    emissions = torch.log_softmax(emissions, dim=-1)
    return emissions

def get_trellis_numba(emission, tokens, blank_id=0):
    num_frame = emission.shape[0]
    num_tokens = len(tokens)

    trellis = np.zeros((num_frame, num_tokens))
    trellis[1:,0] = np.cumsum(emission[1:, blank_id])
    trellis[0, 1:] = -np.inf
    trellis[-num_tokens + 1:, 0] = np.inf

    for t in range(num_frame - 1):
        trellis[t + 1, 1:] = np.maximum(
            # Score for staying at the same token
            trellis[t, 1:] + emission[t, blank_id],
            # Score for changing to the next token
            trellis[t, :-1] + emission[t, tokens[1:]],
        )

    return trellis

def backtrack_numba(trellis, emission, tokens, blank_id=0):
    t, j = trellis.shape[0] - 1, trellis.shape[1] - 1

    path = [(j, t, np.exp(emission[t, blank_id]))]
    while j > 0:
        assert t > 0  # Should not happen but just in case

        # 1. Figure out if the current position was stay or change
        # Frame-wise score of stay vs change
        p_stay = emission[t - 1, blank_id]
        p_change = emission[t - 1, tokens[j]]

        # Context-aware score for stay vs change
        stayed = trellis[t - 1, j] + p_stay
        changed = trellis[t - 1, j - 1] + p_change

        # Update position
        t -= 1
        if changed > stayed:
            j -= 1

        # Store the path with frame-wise probability.
        prob = np.exp(p_change if changed > stayed else p_stay)
        path.append((j, t, prob))

    # Now j == 0, which means it reached the SoS (Start of Sequence).
    # Fill up the rest for the sake of visualization
    while t > 0:
        prob = np.exp(emission[t - 1, blank_id])
        path.append((j, t - 1, prob))
        t -= 1

    return path[::-1]

@dataclass
class Point:
    token_index: int
    time_index: int
    score: float

@dataclass
class Segment:
    label: str
    start: int
    end: int
    score: float
    start_time: float = 0.0
    end_time: float = 0.0

    def __repr__(self):
        return f"{self.label}\t({self.score:4.2f}): [{self.start:5d}, {self.end:5d}), {self.start_time:.2f}s"

    @property
    def length(self):
        return self.end - self.start

    def json(self):
        return {
            "label": self.label,
            "start_time": self.start_time,
            "end_time": self.end_time,
        }

def merge_repeats(path, transcript, ratio):
    i1, i2 = 0, 0
    segments = []
    while i1 < len(path):
        while i2 < len(path) and path[i1].token_index == path[i2].token_index:
            i2 += 1
        score = sum(path[k].score for k in range(i1, i2)) / (i2 - i1)
        segments.append(
            Segment(
                transcript[path[i1].token_index],
                path[i1].time_index,
                path[i2 - 1].time_index + 1,
                score,
                ratio * path[i1].time_index / 16_000,
                (ratio * path[i2 - 1].time_index + 1) / 16_000
            )
        )
        i1 = i2
    return segments

def merge_words(segments, separator="|"):
    words = []
    i1, i2 = 0, 0
    while i1 < len(segments):
        if i2 >= len(segments) or segments[i2].label == separator:
            if i1 != i2:
                segs = segments[i1:i2]
                word = "".join([seg.label for seg in segs])
                score = sum(seg.score * seg.length for seg in segs) / sum(seg.length for seg in segs)
                words.append(Segment(word, segments[i1].start, segments[i2 - 1].end, score, segments[i1].start_time,
                                     segments[i2 - 1].end_time))
            i1 = i2 + 1
            i2 = i1
        else:
            i2 += 1
    return words

The config.pbtxt is:

name: "forced_alignment"
backend: "python"
max_batch_size: 1
input [
  {
    name: "transcription"
    data_type: TYPE_STRING
    dims: [ 1 ]
  },
  {
    name: "audio"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]
output [
  {
    name: "word"
    data_type: TYPE_STRING
    dims: [ -1 ]
  },
  {
    name: "start_time"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "end_time"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]
instance_group {
  count: 1
  kind: KIND_CPU
}

Execution env is not set as I install the required packages in the container.

krishung5 commented 1 month ago

Hi @JamesBowerXanda, Triton doesn't officially support Mac, but I assume it would work if you are only running CPU-only model. I couldn't reproduce the hang using a linux machine. Since I don't have the wav2vec2_asr_base_960h.pt and labels.json files, I replaced the wav2vec2_asr_base_960h.pt with some model.pt and remove the line for labels.json, Triton is not hanging on my side. Could you run the server with --log-verbose=1 and see if there's any error reported in the log?

I also notice that the path for those two files might be incorrect

model_repository = args["model_repository"]
wav2vec2_path = os.path.join(model_repository,"1","wav2vec2_asr_base_960h.pt")
labels_path = os.path.join(model_repository,"1","labels.json")

The args["model_repository"] will return model_repository/forced_alignmen while the wav2vec2_asr_base_960h.pt and labels.json files are under model_repository/baai_quant_onnx/1/.

krishung5 commented 1 week ago

Closing due to lack of activity. Please re-open the issue if you would like to follow up with this issue.