triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
581 stars 81 forks source link

Triton Launches Model on Incorrect GPU #481

Closed ethan-digi closed 1 month ago

ethan-digi commented 1 month ago

Description With:

instance_group [
    {
      count: 1
      kind: KIND_GPU
      gpus: [ 1 ]
    }
  ]

When I run python3 tensorrtllm_backend/scripts/launch_triton_server.py --world_size=1 --model_repo=/opt/tritonserver/inflight_batcher_llm, the process is launched on the incorrect GPU:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:00:06.0 Off |                    0 |
| N/A   46C    P0              79W / 300W |  73875MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          On  | 00000000:00:09.0 Off |                    0 |
| N/A   47C    P0              76W / 300W |    491MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Triton Information What version of Triton are you using?

24.04

Are you using the Triton container or did you build it yourself?

Triton container Using openmpi 5.0.3.

To Reproduce Steps to reproduce the behavior.

Running

git clone https://huggingface.co/Xwin-LM/Xwin-LM-13B-V0.2 python3 quantize.py --model_dir /workspace/Kunoichi-7B/ --dtype float16 --qformat int4_awq --awq_block_size 128 --output_dir /workspace/Xwin-LM-13B-V0.2 --calib_size 32 --batch_size 16 trtllm-build --checkpoint_dir /workspace/xwin_13b_quantized_fp16_kvcache/ --gemm_plugin float16 --gpt_attention_plugin float16 --output_dir /workspace/eng_xwin_13B_quantized_fp16_kvcache --paged_kv_cache enable --max_input_len 32256 --use_paged_context_fmha enable --context_fmha enable python3 tensorrtllm_backend/scripts/launch_triton_server.py --world_size=1 --model_repo=/opt/tritonserver/inflight_batcher_llm

with

instance_group [
    {
      count: 1
      kind: KIND_GPU
      gpus: [ 1 ]
    }
  ]

in config.pbtxt

full config.pbtxt ``` name: "tensorrt_llm" backend: "tensorrtllm" max_batch_size: 2 model_transaction_policy { decoupled: false } dynamic_batching { preferred_batch_size: [ 2 ] max_queue_delay_microseconds: 10000 } input [ { name: "input_ids" data_type: TYPE_INT32 dims: [ -1 ] allow_ragged_batch: true }, { name: "input_lengths" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } }, { name: "request_output_len" data_type: TYPE_INT32 dims: [ 1 ] }, { name: "draft_input_ids" data_type: TYPE_INT32 dims: [ -1 ] optional: true allow_ragged_batch: true }, { name: "draft_logits" data_type: TYPE_FP32 dims: [ -1, -1 ] optional: true allow_ragged_batch: true }, { name: "end_id" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "pad_id" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "stop_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true allow_ragged_batch: true }, { name: "bad_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true allow_ragged_batch: true }, { name: "embedding_bias" data_type: TYPE_FP32 dims: [ -1 ] optional: true allow_ragged_batch: true }, { name: "beam_width" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "temperature" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_k" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_p" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "len_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "repetition_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "min_length" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "presence_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "frequency_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "random_seed" data_type: TYPE_UINT64 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "return_log_probs" data_type: TYPE_BOOL dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "return_context_logits" data_type: TYPE_BOOL dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "return_generation_logits" data_type: TYPE_BOOL dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "stop" data_type: TYPE_BOOL dims: [ 1 ] optional: true }, { name: "streaming" data_type: TYPE_BOOL dims: [ 1 ] optional: true }, { name: "prompt_embedding_table" data_type: TYPE_FP16 dims: [ -1, -1 ] optional: true allow_ragged_batch: true }, { name: "prompt_vocab_size" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, # the unique task ID for the given LoRA. # To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given. # The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`. # If the cache is full the oldest LoRA will be evicted to make space for new ones. An error is returned if `lora_task_id` is not cached. { name: "lora_task_id" data_type: TYPE_UINT64 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, # weights for a lora adapter shape [ num_lora_modules_layers, D x Hi + Ho x D ] # where the last dimension holds the in / out adapter weights for the associated module (e.g. attn_qkv) and model layer # each of the in / out tensors are first flattened and then concatenated together in the format above. # D=adapter_size (R value), Hi=hidden_size_in, Ho=hidden_size_out. { name: "lora_weights" data_type: TYPE_FP16 dims: [ -1, -1 ] optional: true allow_ragged_batch: true }, # module identifier (same size a first dimension of lora_weights) # See LoraModule::ModuleType for model id mapping # # "attn_qkv": 0 # compbined qkv adapter # "attn_q": 1 # q adapter # "attn_k": 2 # k adapter # "attn_v": 3 # v adapter # "attn_dense": 4 # adapter for the dense layer in attention # "mlp_h_to_4h": 5 # for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection # "mlp_4h_to_h": 6 # for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection # "mlp_gate": 7 # for llama2 adapter for gated mlp later after attention / RMSNorm: gate # # last dim holds [ module_id, layer_idx, adapter_size (D aka R value) ] { name: "lora_config" data_type: TYPE_INT32 dims: [ -1, 3 ] optional: true allow_ragged_batch: true } ] output [ { name: "output_ids" data_type: TYPE_INT32 dims: [ -1, -1 ] }, { name: "sequence_length" data_type: TYPE_INT32 dims: [ -1 ] }, { name: "cum_log_probs" data_type: TYPE_FP32 dims: [ -1 ] }, { name: "output_log_probs" data_type: TYPE_FP32 dims: [ -1, -1 ] }, { name: "context_logits" data_type: TYPE_FP32 dims: [ -1, -1 ] }, { name: "generation_logits" data_type: TYPE_FP32 dims: [ -1, -1, -1 ] } ] instance_group [ { count: 1 kind: KIND_GPU gpus: [ 1 ] } ] parameters: { key: "max_beam_width" value: { string_value: "${max_beam_width}" } } parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value: { string_value: "no" } } parameters: { key: "gpt_model_type" value: { string_value: "inflight_fused_batching" } } parameters: { key: "gpt_model_path" value: { string_value: "/opt/tritonserver/inflight_batcher_llm/tensorrt_llm/1/" } } parameters: { key: "max_tokens_in_paged_kv_cache" value: { string_value: "${max_tokens_in_paged_kv_cache}" } } parameters: { key: "max_attention_window_size" value: { string_value: "${max_attention_window_size}" } } parameters: { key: "batch_scheduler_policy" value: { string_value: "${batch_scheduler_policy}" } } parameters: { key: "kv_cache_free_gpu_mem_fraction" value: { string_value: "0.9" } } parameters: { key: "enable_trt_overlap" value: { string_value: "${enable_trt_overlap}" } } parameters: { key: "exclude_input_in_output" value: { string_value: "true" } } parameters: { key: "enable_kv_cache_reuse" value: { string_value: "true" } } parameters: { key: "normalize_log_probs" value: { string_value: "${normalize_log_probs}" } } parameters: { key: "enable_chunked_context" value: { string_value: "${enable_chunked_context}" } } parameters: { key: "gpu_device_ids" value: { string_value: "${gpu_device_ids}" } } parameters: { key: "lora_cache_optimal_adapter_size" value: { string_value: "${lora_cache_optimal_adapter_size}" } } parameters: { key: "lora_cache_max_adapter_size" value: { string_value: "${lora_cache_max_adapter_size}" } } parameters: { key: "lora_cache_gpu_memory_fraction" value: { string_value: "0" } } parameters: { key: "lora_cache_host_memory_bytes" value: { string_value: "${lora_cache_host_memory_bytes}" } } parameters: { key: "decoding_mode" value: { string_value: "${decoding_mode}" } } parameters: { key: "worker_path" value: { string_value: "/opt/tritonserver/backends/tensorrtllm/triton_tensorrtllm_worker" } } parameters: { key: "medusa_choices" value: { string_value: "${medusa_choices}" } } ```

Expected behavior TensorRT launches normally on GPU 1, as specified

rmccorm4 commented 1 month ago

Hi @ethan-digi, I believe the TensorRT-LLM backend may not currently be accounting for the TRITONBACKEND_ModelInstanceDeviceId that Triton assigns to it from the instance group settings you're referencing. I'm going to move this issue to https://github.com/triton-inference-server/tensorrtllm_backend/issues to see if they can help comment.

As a workaround, you may be able to isolate the desired GPU via CUDA_VISIBLE_DEVICES within the container, or by isolating which gpus are exposed to the container with docker run --gpus ...

rmccorm4 commented 1 month ago

CC @pcastonguay

ethan-digi commented 1 month ago

@rmccorm4 thank you very much for your response + proper categorization. I agree that CUDA_VISIBLE_DEVICES/docker run with gpus param would work, but the real issue for me is that I'm trying to get the triton server to run two copies of the model, and it overpopulates the first GPU.

instance_group [
    {
      count: 1
      kind: KIND_GPU
      gpus: [ 0, 1 ]
    }
  ]

This results in two loads happening on the first gpu, and consequent VRAM OOM. The present issue is kind of a sub-bug I found while attempting to debug that. Here's the issue regarding that larger problem

pcastonguay commented 1 month ago

Hi @ethan-digi , we are aware of this issue with the trt-llm backend not respecting the instance_group parameters. Currently, the GPU device id to use for a particular model instance should be specified in the config.pbtxt here: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt#L468-L473 So for a single GPU model, you would set:

string_value: "0"

for example, while for a multi-GPU model:

string_value: "0, 1, 2, 3"

In order to have two instances running on two different GPUs, you currently would need to make multiple copies of the tensorrt_llm folder, each with it's own gpu_device_ids parameter. For single GPU models, we could probably allow the Triton instance_group parameters to override the gpu_device_ids parameter, although that's not implemented yet. I have created a ticket to support that.

ethan-digi commented 1 month ago

@pcastonguay thank you very much for the explanation, and I'm glad my issue has inspired an improvement to the service.

One question - when you say

you currently would need to make multiple copies of the tensorrt_llm folder

which aspects of the folder do you mean? You can't make multiple directories named tensorrt_llm within inflight_batcher_llm, and you can't make multiple files named config.pbtxt within tensorrt_llm. So I'm not exactly sure still how to replicate that file structure, though I grasp the concept of in general duplicating the config but changing the device specification. Thank you very much, knowing how to achieve multiple model instances in one server instance will be really useful for load balancing.

ethan-digi commented 1 month ago

Edit: Figured it out. I had to read up on how the server loads models, but I have both running via the --multi-model parameter. Thank you both very much @rmccorm4 @pcastonguay