Deploy multiple instances

thucth-qt commented 11 months ago

Description Error raised when calling more than one request at the same time. Pipeline Stable Diffusion 2.1. Requests were called with perf_analyzer.

Triton Information I use triton container to deploy nvcr.io/nvidia/tritonserver:23.09-py3

To Reproduce config.pbtxt - success

backend: "python"
instance_group [
  { 
    kind: KIND_GPU
    gpus: [0]
  }
 ]

max_batch_size: 1

input [

  {
    name: "PROMPT"
    data_type: TYPE_STRING
    dims: [1]
  },
  {
    name: "NEGATIVE_PROMPT"
    data_type: TYPE_STRING
    dims: [1]
  },
  {
    name: "HEIGHT"
    data_type: TYPE_INT32
    dims: [1]
  },
  {
    name: "WIDTH"
    data_type: TYPE_INT32
    dims: [1]
  },
  {
    name: "SEED"
    data_type: TYPE_INT32
    dims: [1]
  },
  {
    name: "NUM_INFERENCE_STEPS"
    data_type: TYPE_INT32
    dims: [1]
  },
  {
    name: "GUIDANCE_SCALE"
    data_type: TYPE_FP32
    dims: [1]
  }
]

output [ 
  {
    name: "IMAGES"
    data_type: TYPE_UINT8
    dims: [-1, -1, -1, -1]
  }
]

if we modify above configuration, error orcured config.pbtxt - error

instance_group [
  { 
    kind: KIND_GPU
    gpus: [1]
  }
 ]

or config.pbtxt - error

instance_group [
  { 
    kind: KIND_GPU
    gpus: [0]
    count: 2
  }
 ]

we implemented models following this https://github.com/NVIDIA/TensorRT/tree/release/8.6/demo/Diffusion. we converted models following this https://github.com/NVIDIA/TensorRT/blob/release/8.6/demo/Diffusion/demo_txt2img.py#L87C73-L87C73

Expected behavior Running successfully with multiple instances on different GPU devices.

kthui commented 11 months ago

Hi @thucth-qt, can you share the Triton server log when encountering the above two errors? The server can print more detailed log if setting --log-verbose=2 to the command line when starting the server.

thucth-qt commented 11 months ago

Hi @thucth-qt, can you share the Triton server log when encountering the above two errors? The server can print more detailed log if setting --log-verbose=2 to the command line when starting the server.

Hi @kthui, let's fix one error each time. For the config with GPU=1, here are the triton logs:

Loading TensorRT engine: /raw_weights/pretrained_pipes/engine_2.1/clip.plan
[W] 'colored' module is not installed, will not use colors when logging. To enable colors, please install the 'colored' module: python3 -m pip install colored
[I] Loading bytes from /raw_weights/pretrained_pipes/engine_2.1/clip.plan
Loading TensorRT engine: /raw_weights/pretrained_pipes/engine_2.1/unet.plan
[I] Loading bytes from /raw_weights/pretrained_pipes/engine_2.1/unet.plan
Loading TensorRT engine: /raw_weights/pretrained_pipes/engine_2.1/vae.plan
[I] Loading bytes from /raw_weights/pretrained_pipes/engine_2.1/vae.plan
[I] Load TensorRT engines and pytorch modules takes  4.807667993940413
[I] Load resources takes  0.16484481398947537
[I] Warming up ..
[E] 1: [runner.cpp::executeMyelinGraph::715] Error Code 1: Myelin ([exec] Platform (Cuda) error)
[E] 1: [checkMacros.cpp::catchCudaError::203] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
I1124 14:03:14.278660 588 pb_stub.cc:323] Failed to initialize Python stub: ValueError: ERROR: inference failed.

At:
  /trt_server/artian/utils/tensorrt/utilities.py(268): infer
  /trt_server/artian/utils/tensorrt/stable_diffusion_pipeline.py(328): runEngine
  /trt_server/artian/utils/tensorrt/stable_diffusion_pipeline.py(375): encode_prompt
  /trt_server/artian/utils/tensorrt/txt2img_pipeline.py(100): infer
  /models/v21/1/model.py(97): initialize

[E] 1: [graphContext.h::~MyelinGraphContext::55] Error Code 1: Myelin ([exec] Platform (Cuda) error)
[E] 1: [defaultAllocator.cpp::deallocate::61] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[E] 1: [cudaResources.cpp::~ScopedCudaStream::47] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[E] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[E] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
[E] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
...

Here is the line that raises the error:

Here is the way I get the device name:

Here is the config, whenever device is GPU 1, error is raised

backend: "python"
instance_group [
  { 
    kind: KIND_GPU
    gpus: [1]
  }
 ]
 ...

kthui commented 11 months ago

Since it is on Python backend, I think the tensors will be on GPU 1 if setting gpus: [1]. Can you double check if the TensorRT engine is reading the tensors from GPU 0? This might explain the illegal memory access if this is the case.

thucth-qt commented 11 months ago

Since it is on Python backend, I think the tensors will be on GPU 1 if setting gpus: [1]. Can you double check if the TensorRT engine is reading the tensors from GPU 0? This might explain the illegal memory access if this is the case.

As I know, with the Python backend, inside the TritonPythonModel implementation, we have to read GPU device from the config.pbtxt file using args['model_instance_device_id'] and allocate the GPU ourselves (or at least I do not see it automatically allocating the proper device by only declaring in the config.pbtxt file).
Inside TritonPythonModel, I know the only way to allocate a device for Pipeline is by specifying the parameter device=self.device (ref. https://github.com/NVIDIA/TensorRT/blob/3aaa97b91ee1dd61ea46f78683d9a3438f26192e/demo/Diffusion/stable_diffusion_pipeline.py#L43).
The error is caused by illegal memory access, data is indeed access from GPU 0 instead of GPU 1 although I declare this device in all places I can. Could you tell me where I can configure to put total tensors / pipelines to some target device?

kthui commented 11 months ago

we have to read GPU device from the config.pbtxt file using args['model_instance_device_id'] and allocate the GPU ourselves (or at least I do not see it automatically allocating the proper device by only declaring in the config.pbtxt file).

Would you be able to give Input Tensor Device Placement a try? I think this is defaulted to "yes", so the input tensors are on CPU device.

only way to allocate a device for Pipeline is by specifying the parameter device=self.device ... data is indeed access from GPU 0 instead of GPU 1

I think the device str parameter is expecting a PyTorch device string, I am not sure what is contained in self.device. Would you be able to try device="cuda:1"? You can find some examples on how PyTorch formats the device string here.

monk-after-90s commented 11 months ago

@thucth-qt hello, have you solve this problem? I have exactly the same problem😭

monk-after-90s commented 10 months ago

I have solved such problem by steps:

transfer the pytorch model to onnx
transfer the onnx to trt by tensorrt cmd "trtexec" with specific parameters like "--minShapes", "--optShapes", "--maxShapes" and "--fp16"
deploy the trt to Triton

thucth-qt commented 10 months ago

Hi @kthui

I tried all methods in your suggestion but it doesn't work. I think some part of the model in StableDiffusionPipeline (https://github.com/NVIDIA/TensorRT/blob/3aaa97b91ee1dd61ea46f78683d9a3438f26192e/demo/Diffusion/stable_diffusion_pipeline.py#L30C7-L30C30) is always loaded into device:0 regardless which device I specified for parameter device.

Btw, could you tell me the differences between running models as @monk-after-90s 's answer above (https://github.com/triton-inference-server/server/issues/6628#issuecomment-1859084640) and running models using polygraphy.backend.trt (https://github.com/NVIDIA/TensorRT/blob/3aaa97b91ee1dd61ea46f78683d9a3438f26192e/demo/Diffusion/stable_diffusion_pipeline.py#L30C7-L30C30)? Which is the best practice between deploying a converted model and deploying an Engine?

triton-inference-server / server

Deploy multiple instances #6628