triton-inference-server / pytriton

PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.
https://triton-inference-server.github.io/pytriton/
Apache License 2.0
687 stars 45 forks source link

Model is not initialized to GPU. #77

Open jaehyeong-bespin opened 2 weeks ago

jaehyeong-bespin commented 2 weeks ago

Description

The Tesla T4 GPU server is using pyriton. However, even though the triton server recognizes the existence of the GPU, it is a problem to allocate the model to the CPU during model initialization.

Below is the log when running the triton server. 스크린샷 2024-06-24 오후 4 46 59

Looking at the log, it can be confirmed that the metrics of the triton server recognized GPU (GPU 0: Tesla T4). However, it may be seen that all models are allocated to CPU (CPU device 0).

I0624 07:34:17.037166 1320065 python_be.cc:1977] TRITONBACKEND_ModelInstanceInitialize: bb8-embedder-nlu_0 (CPU device 0)
I0624 07:34:17.083562 1320065 python_be.cc:1977] TRITONBACKEND_ModelInstanceInitialize: bb8-embedder-assist-biencoder-query_0 (CPU device 0)
I0624 07:34:17.096001 1320065 python_be.cc:1977] TRITONBACKEND_ModelInstanceInitialize: bb8-embedder-assist-crossencoder_0 (CPU device 0)
I0624 07:34:17.117102 1320065 python_be.cc:1977] TRITONBACKEND_ModelInstanceInitialize: bb8-embedder-assist-biencoder-passage_0 (CPU device 0)

To reproduce

Below is the reproduction code for one model.

from pathlib import Path
import numpy as np
import torch
from sentence_transformers import SentenceTransformer, CrossEncoder

from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor, DeviceKind
from pytriton.model_config.triton_model_config import TritonModelConfig
from pytriton.model_config.parser import ModelConfigParser
from pytriton.triton import Triton, TritonConfig

# Load SentenceTransformer model 
nlu_embedder = SentenceTransformer('bespin-global/klue-sroberta-base-continue-learning-by-mnr', device=device)

@batch
def _infer_fn_nlu(sequence: np.ndarray):
    sequence = np.char.decode(sequence.astype("bytes"), "utf-8")  # need to convert dtype=object to bytes first
    sequence = sum(sequence.tolist(), [])

    embed_vectors = nlu_embedder.encode(sequence, device=device)

    return {'embed_vectors': embed_vectors}

with Triton(config= TritonConfig(allow_gpu_metrics=True)) as triton:
    triton.bind(
        model_name="bb8-embedder-nlu",
        infer_func=_infer_fn_nlu,
        inputs=[
            Tensor(name="sequence", dtype=bytes, shape=(1,)),
        ],
        outputs=[
            Tensor(name="embed_vectors", dtype=bytes, shape=(-1,)),
        ],
        # config=ModelConfig(max_batch_size=args.max_batch_size),
        config=ModelConfigParser.from_file(config_path=Path('./model_config/bb8-embedder-nlu.pbtxt')),
        strict=True
    )

At first, ModelConfig() was used for the config item at the time of triton.bind(), but due to the above-mentioned problem, a config.pbtxt file was created and model config information was assigned through ModelConfigParser().

Below is the contents of the config.pbtxt file.

name: "bb8-embedder-nlu"
max_batch_size: 32
input {
  name: "sequence"
  data_type: TYPE_STRING
  dims: 1
}
output {
  name: "embed_vectors"
  data_type: TYPE_STRING
  dims: -1
}
instance_group {
  count: 1
  kind: KIND_GPU
  gpus: [ 0 ]
}
dynamic_batching {
}
backend: "python"

As you can see in the pbtxt file, even though the instance_group specified Kind_GPU in the kind entry, it is not assigned to the GPU.

The expected outcome is that the models are assigned as GPU device 0. How can I solve this?

Environment

Additional context The result of the pbtxt generated within the /.cache/pytriton/workspace ~/model_store when you run the triton server. 스크린샷 2024-06-24 오후 5 03 52

Despite typing KIND_GPU directly into the model config file, it changes to KIND_CPU and works.