Tritonserver hangs on launch with python backend #7268

JamesBowerXanda commented 1 month ago

JamesBowerXanda commented 1 month ago

Description I am trying to use the triton server on cpu only model and during launch the server will launch perfectly with only ONNX models but the moment I include a python backend model it hangs on launch eternally.

I am using an Apple M2 Mac.

It is worth noting that the model runs when I use the Sagemaker Triton Server Image on a sagemaker multimodel endpoint.

Triton Information

Version? 23.02 although have also tried 24.04.

Are you using the Triton container or did you build it yourself? Container. Specifically

To Reproduce

  1. Pull
  2. Create models_repository with a python backend model.
  3. Launch the docker container without starting the server using docker run -it -p8000:8000 -p8001:8001 -p8002:8002 -v/Users/jamesbower/Projects/triton-local/model_repository:/models /bin/bash
  4. Install python packages as system packages pip install --no-cache-dir torch --index-url and pip install --no-cache-dir numpy.
  5. CD to where models directory is loacated.
  6. Run tritonserver --model-repository models/


The following is displayed:

W0524 08:06:50.694589 82] Unable to allocate pinned system memory, pinned memory pool will not be available: no CUDA-capable device is detected
I0524 08:06:50.694772 82] CUDA memory pool disabled
I0524 08:06:50.705575 82] loading: baai_quant_onnx:1
I0524 08:06:50.706361 82] loading: forced_alignment:1
I0524 08:06:50.707123 82] loading: titanet_small_onnx:1
I0524 08:06:50.707594 82] TRITONBACKEND_Initialize: onnxruntime
I0524 08:06:50.707624 82] Triton TRITONBACKEND API version: 1.11
I0524 08:06:50.707628 82] 'onnxruntime' TRITONBACKEND API version: 1.11
I0524 08:06:50.707632 82] backend configuration:
I0524 08:06:50.718814 82] TRITONBACKEND_ModelInitialize: baai_quant_onnx (version 1)
I0524 08:06:50.719420 82] skipping model configuration auto-complete for 'baai_quant_onnx': inputs and outputs already specified

It just hangs here eternally.

Expected behavior Triton server launched completely such that curl -v localhost:8000/v2/health/ready receives a status 200 response.

Model Repository Setup

The structure of the model repository is:

|  |--1
|  |  |--model.onnx
|  |  |--labels.json
|  |  |
|  |--config.pbtxt
|  |--1
|  |  |-model.onnx
|  |--config.pbtxt
|  |--1
|  |  |
|  |--config.pbtxt

I am not using a conda packed execution environment since I install the required packages in the container after launching it. I have also tried with a conda packed conda env though which is the method I used with the SageMaker Triton Server image.

The file is

import triton_python_backend_utils as pb_utils
import numpy as np
import json
import torch
import re
import os
from dataclasses import dataclass

class TritonPythonModel:

    def initialize(self, args):
        self.word_output_type = pb_utils.triton_string_to_numpy(
            pb_utils.get_output_config_by_name(json.loads(args["model_config"]), "word")["data_type"]
        self.start_time_output_type = pb_utils.triton_string_to_numpy(
            pb_utils.get_output_config_by_name(json.loads(args["model_config"]), "start_time")["data_type"]
        self.end_time_output_type = pb_utils.triton_string_to_numpy(
            pb_utils.get_output_config_by_name(json.loads(args["model_config"]), "end_time")["data_type"]

        model_repository = args["model_repository"]
        wav2vec2_path = os.path.join(model_repository,"1","")
        labels_path = os.path.join(model_repository,"1","labels.json")

        self.wav2vecmodel = torch.jit.load(wav2vec2_path).eval()
        with open(labels_path, "r") as f:
            self.labels = tuple(json.load(f))

    def execute(self, requests):

        responses = []

        for request in requests:
            transcription = pb_utils.get_input_tensor_by_name(request, "transcription").as_numpy().squeeze(1).astype(str)
            audio = pb_utils.get_input_tensor_by_name(request, "audio").as_numpy()            

            # Optional preprocessing code for inputs in standard Python...
            transcription = transcription.tolist()[0]
            transcript = convert_to_transcript(transcription)

            with torch.inference_mode():
                waveform = torch.tensor(audio, dtype=torch.float32)
                emissions = calculate_emissions(self.wav2vecmodel, waveform)

            emissions = emissions[0].cpu().detach().numpy()
            waveform = waveform.cpu().detach().numpy()
            dictionary = {c: i for i, c in enumerate(self.labels)}
            tokens = [dictionary[c] for c in transcript]
            ratio = waveform.shape[1] / emissions.shape[0]

            trellis = get_trellis_numba(emissions, np.array(tokens))

            path = backtrack_numba(trellis, emissions, tokens)
            path = [Point(*p) for p in path]
            segments = merge_repeats(path, transcript, ratio)
            word_segments = merge_words(segments)

            words = []
            start_times = []
            end_times = []

            for segment in word_segments:

            words = np.array([words]).astype(self.word_output_type)
            start_times = np.array([start_times]).astype(self.start_time_output_type)
            end_times = np.array([end_times]).astype(self.end_time_output_type)

            output_tensor_words = pb_utils.Tensor("word", words)
            output_tensor_start_times = pb_utils.Tensor("start_time", start_times)
            output_tensor_end_times = pb_utils.Tensor("end_time", end_times)

            response = pb_utils.InferenceResponse(
                output_tensors=[output_tensor_words, output_tensor_start_times, output_tensor_end_times]


        return responses

    # Any cleanup code to be used when the model is unloaded. Not completely sure of the degree to which this is required currently.
    def finalize(self):
        print("Finalizing model...")

def convert_to_transcript(text: str):
    text = text.upper().strip()
    text = re.sub(r"[^A-Z0-9\s]", "", text)
    text = "|" + re.sub(r"\s+", "|", text) + "|"
    return text    

def calculate_emissions(wav2vecmodel, waveform) -> torch.Tensor:
    emissions, _ = wav2vecmodel(waveform)
    emissions = torch.log_softmax(emissions, dim=-1)
    return emissions

def get_trellis_numba(emission, tokens, blank_id=0):
    num_frame = emission.shape[0]
    num_tokens = len(tokens)

    trellis = np.zeros((num_frame, num_tokens))
    trellis[1:,0] = np.cumsum(emission[1:, blank_id])
    trellis[0, 1:] = -np.inf
    trellis[-num_tokens + 1:, 0] = np.inf

    for t in range(num_frame - 1):
        trellis[t + 1, 1:] = np.maximum(
            # Score for staying at the same token
            trellis[t, 1:] + emission[t, blank_id],
            # Score for changing to the next token
            trellis[t, :-1] + emission[t, tokens[1:]],

    return trellis

def backtrack_numba(trellis, emission, tokens, blank_id=0):
    t, j = trellis.shape[0] - 1, trellis.shape[1] - 1

    path = [(j, t, np.exp(emission[t, blank_id]))]
    while j > 0:
        assert t > 0  # Should not happen but just in case

        # 1. Figure out if the current position was stay or change
        # Frame-wise score of stay vs change
        p_stay = emission[t - 1, blank_id]
        p_change = emission[t - 1, tokens[j]]

        # Context-aware score for stay vs change
        stayed = trellis[t - 1, j] + p_stay
        changed = trellis[t - 1, j - 1] + p_change

        # Update position
        t -= 1
        if changed > stayed:
            j -= 1

        # Store the path with frame-wise probability.
        prob = np.exp(p_change if changed > stayed else p_stay)
        path.append((j, t, prob))

    # Now j == 0, which means it reached the SoS (Start of Sequence).
    # Fill up the rest for the sake of visualization
    while t > 0:
        prob = np.exp(emission[t - 1, blank_id])
        path.append((j, t - 1, prob))
        t -= 1

    return path[::-1]

class Point:
    token_index: int
    time_index: int
    score: float

class Segment:
    label: str
    start: int
    end: int
    score: float
    start_time: float = 0.0
    end_time: float = 0.0

    def __repr__(self):
        return f"{self.label}\t({self.score:4.2f}): [{self.start:5d}, {self.end:5d}), {self.start_time:.2f}s"

    def length(self):
        return self.end - self.start

    def json(self):
        return {
            "label": self.label,
            "start_time": self.start_time,
            "end_time": self.end_time,

def merge_repeats(path, transcript, ratio):
    i1, i2 = 0, 0
    segments = []
    while i1 < len(path):
        while i2 < len(path) and path[i1].token_index == path[i2].token_index:
            i2 += 1
        score = sum(path[k].score for k in range(i1, i2)) / (i2 - i1)
                path[i2 - 1].time_index + 1,
                ratio * path[i1].time_index / 16_000,
                (ratio * path[i2 - 1].time_index + 1) / 16_000
        i1 = i2
    return segments

def merge_words(segments, separator="|"):
    words = []
    i1, i2 = 0, 0
    while i1 < len(segments):
        if i2 >= len(segments) or segments[i2].label == separator:
            if i1 != i2:
                segs = segments[i1:i2]
                word = "".join([seg.label for seg in segs])
                score = sum(seg.score * seg.length for seg in segs) / sum(seg.length for seg in segs)
                words.append(Segment(word, segments[i1].start, segments[i2 - 1].end, score, segments[i1].start_time,
                                     segments[i2 - 1].end_time))
            i1 = i2 + 1
            i2 = i1
            i2 += 1
    return words

The config.pbtxt is:

name: "forced_alignment"
backend: "python"
max_batch_size: 1
input [
    name: "transcription"
    data_type: TYPE_STRING
    dims: [ 1 ]
    name: "audio"
    data_type: TYPE_FP32
    dims: [ -1 ]
output [
    name: "word"
    data_type: TYPE_STRING
    dims: [ -1 ]
    name: "start_time"
    data_type: TYPE_FP32
    dims: [ -1 ]
    name: "end_time"
    data_type: TYPE_FP32
    dims: [ -1 ]
instance_group {
  count: 1
  kind: KIND_CPU

Execution env is not set as I install the required packages in the container.

krishung5 commented 1 month ago

Hi @JamesBowerXanda, Triton doesn't officially support Mac, but I assume it would work if you are only running CPU-only model. I couldn't reproduce the hang using a linux machine. Since I don't have the and labels.json files, I replaced the with some and remove the line for labels.json, Triton is not hanging on my side. Could you run the server with --log-verbose=1 and see if there's any error reported in the log?

I also notice that the path for those two files might be incorrect

model_repository = args["model_repository"]
wav2vec2_path = os.path.join(model_repository,"1","")
labels_path = os.path.join(model_repository,"1","labels.json")

The args["model_repository"] will return model_repository/forced_alignmen while the and labels.json files are under model_repository/baai_quant_onnx/1/.

krishung5 commented 1 week ago

Closing due to lack of activity. Please re-open the issue if you would like to follow up with this issue.