triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.3k stars 1.48k forks source link

Deploy multiple instances #6628

Open thucth-qt opened 11 months ago

thucth-qt commented 11 months ago

Description Error raised when calling more than one request at the same time. Pipeline Stable Diffusion 2.1. Requests were called with perf_analyzer.

Triton Information I use triton container to deploy nvcr.io/nvidia/tritonserver:23.09-py3

To Reproduce config.pbtxt - success

backend: "python"
instance_group [
  { 
    kind: KIND_GPU
    gpus: [0]
  }
 ]

max_batch_size: 1

input [

  {
    name: "PROMPT"
    data_type: TYPE_STRING
    dims: [1]
  },
  {
    name: "NEGATIVE_PROMPT"
    data_type: TYPE_STRING
    dims: [1]
  },
  {
    name: "HEIGHT"
    data_type: TYPE_INT32
    dims: [1]
  },
  {
    name: "WIDTH"
    data_type: TYPE_INT32
    dims: [1]
  },
  {
    name: "SEED"
    data_type: TYPE_INT32
    dims: [1]
  },
  {
    name: "NUM_INFERENCE_STEPS"
    data_type: TYPE_INT32
    dims: [1]
  },
  {
    name: "GUIDANCE_SCALE"
    data_type: TYPE_FP32
    dims: [1]
  }
]

output [ 
  {
    name: "IMAGES"
    data_type: TYPE_UINT8
    dims: [-1, -1, -1, -1]
  }
]

if we modify above configuration, error orcured config.pbtxt - error

instance_group [
  { 
    kind: KIND_GPU
    gpus: [1]
  }
 ]

or config.pbtxt - error

instance_group [
  { 
    kind: KIND_GPU
    gpus: [0]
    count: 2
  }
 ]

we implemented models following this https://github.com/NVIDIA/TensorRT/tree/release/8.6/demo/Diffusion. we converted models following this https://github.com/NVIDIA/TensorRT/blob/release/8.6/demo/Diffusion/demo_txt2img.py#L87C73-L87C73

Expected behavior Running successfully with multiple instances on different GPU devices.

kthui commented 11 months ago

Hi @thucth-qt, can you share the Triton server log when encountering the above two errors? The server can print more detailed log if setting --log-verbose=2 to the command line when starting the server.

thucth-qt commented 11 months ago

Hi @thucth-qt, can you share the Triton server log when encountering the above two errors? The server can print more detailed log if setting --log-verbose=2 to the command line when starting the server.

Hi @kthui, let's fix one error each time. For the config with GPU=1, here are the triton logs:

Loading TensorRT engine: /raw_weights/pretrained_pipes/engine_2.1/clip.plan
[W] 'colored' module is not installed, will not use colors when logging. To enable colors, please install the 'colored' module: python3 -m pip install colored
[I] Loading bytes from /raw_weights/pretrained_pipes/engine_2.1/clip.plan
Loading TensorRT engine: /raw_weights/pretrained_pipes/engine_2.1/unet.plan
[I] Loading bytes from /raw_weights/pretrained_pipes/engine_2.1/unet.plan
Loading TensorRT engine: /raw_weights/pretrained_pipes/engine_2.1/vae.plan
[I] Loading bytes from /raw_weights/pretrained_pipes/engine_2.1/vae.plan
[I] Load TensorRT engines and pytorch modules takes  4.807667993940413
[I] Load resources takes  0.16484481398947537
[I] Warming up ..
[E] 1: [runner.cpp::executeMyelinGraph::715] Error Code 1: Myelin ([exec] Platform (Cuda) error)
[E] 1: [checkMacros.cpp::catchCudaError::203] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
I1124 14:03:14.278660 588 pb_stub.cc:323] Failed to initialize Python stub: ValueError: ERROR: inference failed.

At:
  /trt_server/artian/utils/tensorrt/utilities.py(268): infer
  /trt_server/artian/utils/tensorrt/stable_diffusion_pipeline.py(328): runEngine
  /trt_server/artian/utils/tensorrt/stable_diffusion_pipeline.py(375): encode_prompt
  /trt_server/artian/utils/tensorrt/txt2img_pipeline.py(100): infer
  /models/v21/1/model.py(97): initialize

[E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin ([exec] Platform (Cuda) error)
[E] 1: [defaultAllocator.cpp::deallocate::61] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[E] 1: [cudaResources.cpp::~ScopedCudaStream::47] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[E] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[E] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[E] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
...

Here is the line that raises the error: image

Here is the way I get the device name: image

Here is the config, whenever device is GPU 1, error is raised

backend: "python"
instance_group [
  { 
    kind: KIND_GPU
    gpus: [1]
  }
 ]
 ...
kthui commented 11 months ago

Since it is on Python backend, I think the tensors will be on GPU 1 if setting gpus: [1]. Can you double check if the TensorRT engine is reading the tensors from GPU 0? This might explain the illegal memory access if this is the case.

thucth-qt commented 11 months ago

Since it is on Python backend, I think the tensors will be on GPU 1 if setting gpus: [1]. Can you double check if the TensorRT engine is reading the tensors from GPU 0? This might explain the illegal memory access if this is the case.

kthui commented 11 months ago

we have to read GPU device from the config.pbtxt file using args['model_instance_device_id'] and allocate the GPU ourselves (or at least I do not see it automatically allocating the proper device by only declaring in the config.pbtxt file).

Would you be able to give Input Tensor Device Placement a try? I think this is defaulted to "yes", so the input tensors are on CPU device.

only way to allocate a device for Pipeline is by specifying the parameter device=self.device ... data is indeed access from GPU 0 instead of GPU 1

I think the device str parameter is expecting a PyTorch device string, I am not sure what is contained in self.device. Would you be able to try device="cuda:1"? You can find some examples on how PyTorch formats the device string here.

monk-after-90s commented 11 months ago

@thucth-qt hello, have you solve this problem? I have exactly the same problem😭

monk-after-90s commented 10 months ago

I have solved such problem by steps:

  1. transfer the pytorch model to onnx
  2. transfer the onnx to trt by tensorrt cmd "trtexec" with specific parameters like "--minShapes", "--optShapes", "--maxShapes" and "--fp16"
  3. deploy the trt to Triton
thucth-qt commented 10 months ago

Hi @kthui

I tried all methods in your suggestion but it doesn't work. I think some part of the model in StableDiffusionPipeline (https://github.com/NVIDIA/TensorRT/blob/3aaa97b91ee1dd61ea46f78683d9a3438f26192e/demo/Diffusion/stable_diffusion_pipeline.py#L30C7-L30C30) is always loaded into device:0 regardless which device I specified for parameter device.

Btw, could you tell me the differences between running models as @monk-after-90s 's answer above (https://github.com/triton-inference-server/server/issues/6628#issuecomment-1859084640) and running models using polygraphy.backend.trt (https://github.com/NVIDIA/TensorRT/blob/3aaa97b91ee1dd61ea46f78683d9a3438f26192e/demo/Diffusion/stable_diffusion_pipeline.py#L30C7-L30C30)? Which is the best practice between deploying a converted model and deploying an Engine?