Closed ethan-digi closed 1 month ago
Hi @ethan-digi, I believe the TensorRT-LLM backend may not currently be accounting for the TRITONBACKEND_ModelInstanceDeviceId
that Triton assigns to it from the instance group settings you're referencing. I'm going to move this issue to https://github.com/triton-inference-server/tensorrtllm_backend/issues to see if they can help comment.
As a workaround, you may be able to isolate the desired GPU via CUDA_VISIBLE_DEVICES
within the container, or by isolating which gpus are exposed to the container with docker run --gpus ...
CC @pcastonguay
@rmccorm4 thank you very much for your response + proper categorization. I agree that CUDA_VISIBLE_DEVICES
/docker run with gpus param would work, but the real issue for me is that I'm trying to get the triton server to run two copies of the model, and it overpopulates the first GPU.
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0, 1 ]
}
]
This results in two loads happening on the first gpu, and consequent VRAM OOM. The present issue is kind of a sub-bug I found while attempting to debug that. Here's the issue regarding that larger problem
Hi @ethan-digi , we are aware of this issue with the trt-llm backend not respecting the instance_group
parameters. Currently, the GPU device id to use for a particular model instance should be specified in the config.pbtxt
here: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt#L468-L473
So for a single GPU model, you would set:
string_value: "0"
for example, while for a multi-GPU model:
string_value: "0, 1, 2, 3"
In order to have two instances running on two different GPUs, you currently would need to make multiple copies of the tensorrt_llm
folder, each with it's own gpu_device_ids
parameter. For single GPU models, we could probably allow the Triton instance_group
parameters to override the gpu_device_ids
parameter, although that's not implemented yet. I have created a ticket to support that.
@pcastonguay thank you very much for the explanation, and I'm glad my issue has inspired an improvement to the service.
One question - when you say
you currently would need to make multiple copies of the
tensorrt_llm
folder
which aspects of the folder do you mean? You can't make multiple directories named tensorrt_llm
within inflight_batcher_llm
, and you can't make multiple files named config.pbtxt
within tensorrt_llm
. So I'm not exactly sure still how to replicate that file structure, though I grasp the concept of in general duplicating the config but changing the device specification. Thank you very much, knowing how to achieve multiple model instances in one server instance will be really useful for load balancing.
Edit: Figured it out. I had to read up on how the server loads models, but I have both running via the --multi-model parameter. Thank you both very much @rmccorm4 @pcastonguay
Description With:
When I run
python3 tensorrtllm_backend/scripts/launch_triton_server.py --world_size=1 --model_repo=/opt/tritonserver/inflight_batcher_llm
, the process is launched on the incorrect GPU:Triton Information What version of Triton are you using?
24.04
Are you using the Triton container or did you build it yourself?
Triton container Using openmpi 5.0.3.
To Reproduce Steps to reproduce the behavior.
Running
git clone https://huggingface.co/Xwin-LM/Xwin-LM-13B-V0.2
python3 quantize.py --model_dir /workspace/Kunoichi-7B/ --dtype float16 --qformat int4_awq --awq_block_size 128 --output_dir /workspace/Xwin-LM-13B-V0.2 --calib_size 32 --batch_size 16
trtllm-build --checkpoint_dir /workspace/xwin_13b_quantized_fp16_kvcache/ --gemm_plugin float16 --gpt_attention_plugin float16 --output_dir /workspace/eng_xwin_13B_quantized_fp16_kvcache --paged_kv_cache enable --max_input_len 32256 --use_paged_context_fmha enable --context_fmha enable
python3 tensorrtllm_backend/scripts/launch_triton_server.py --world_size=1 --model_repo=/opt/tritonserver/inflight_batcher_llm
with
in config.pbtxt
full config.pbtxt
``` name: "tensorrt_llm" backend: "tensorrtllm" max_batch_size: 2 model_transaction_policy { decoupled: false } dynamic_batching { preferred_batch_size: [ 2 ] max_queue_delay_microseconds: 10000 } input [ { name: "input_ids" data_type: TYPE_INT32 dims: [ -1 ] allow_ragged_batch: true }, { name: "input_lengths" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } }, { name: "request_output_len" data_type: TYPE_INT32 dims: [ 1 ] }, { name: "draft_input_ids" data_type: TYPE_INT32 dims: [ -1 ] optional: true allow_ragged_batch: true }, { name: "draft_logits" data_type: TYPE_FP32 dims: [ -1, -1 ] optional: true allow_ragged_batch: true }, { name: "end_id" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "pad_id" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "stop_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true allow_ragged_batch: true }, { name: "bad_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true allow_ragged_batch: true }, { name: "embedding_bias" data_type: TYPE_FP32 dims: [ -1 ] optional: true allow_ragged_batch: true }, { name: "beam_width" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "temperature" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_k" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_p" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "len_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "repetition_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "min_length" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "presence_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "frequency_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "random_seed" data_type: TYPE_UINT64 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "return_log_probs" data_type: TYPE_BOOL dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "return_context_logits" data_type: TYPE_BOOL dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "return_generation_logits" data_type: TYPE_BOOL dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "stop" data_type: TYPE_BOOL dims: [ 1 ] optional: true }, { name: "streaming" data_type: TYPE_BOOL dims: [ 1 ] optional: true }, { name: "prompt_embedding_table" data_type: TYPE_FP16 dims: [ -1, -1 ] optional: true allow_ragged_batch: true }, { name: "prompt_vocab_size" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, # the unique task ID for the given LoRA. # To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given. # The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`. # If the cache is full the oldest LoRA will be evicted to make space for new ones. An error is returned if `lora_task_id` is not cached. { name: "lora_task_id" data_type: TYPE_UINT64 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, # weights for a lora adapter shape [ num_lora_modules_layers, D x Hi + Ho x D ] # where the last dimension holds the in / out adapter weights for the associated module (e.g. attn_qkv) and model layer # each of the in / out tensors are first flattened and then concatenated together in the format above. # D=adapter_size (R value), Hi=hidden_size_in, Ho=hidden_size_out. { name: "lora_weights" data_type: TYPE_FP16 dims: [ -1, -1 ] optional: true allow_ragged_batch: true }, # module identifier (same size a first dimension of lora_weights) # See LoraModule::ModuleType for model id mapping # # "attn_qkv": 0 # compbined qkv adapter # "attn_q": 1 # q adapter # "attn_k": 2 # k adapter # "attn_v": 3 # v adapter # "attn_dense": 4 # adapter for the dense layer in attention # "mlp_h_to_4h": 5 # for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection # "mlp_4h_to_h": 6 # for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection # "mlp_gate": 7 # for llama2 adapter for gated mlp later after attention / RMSNorm: gate # # last dim holds [ module_id, layer_idx, adapter_size (D aka R value) ] { name: "lora_config" data_type: TYPE_INT32 dims: [ -1, 3 ] optional: true allow_ragged_batch: true } ] output [ { name: "output_ids" data_type: TYPE_INT32 dims: [ -1, -1 ] }, { name: "sequence_length" data_type: TYPE_INT32 dims: [ -1 ] }, { name: "cum_log_probs" data_type: TYPE_FP32 dims: [ -1 ] }, { name: "output_log_probs" data_type: TYPE_FP32 dims: [ -1, -1 ] }, { name: "context_logits" data_type: TYPE_FP32 dims: [ -1, -1 ] }, { name: "generation_logits" data_type: TYPE_FP32 dims: [ -1, -1, -1 ] } ] instance_group [ { count: 1 kind: KIND_GPU gpus: [ 1 ] } ] parameters: { key: "max_beam_width" value: { string_value: "${max_beam_width}" } } parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value: { string_value: "no" } } parameters: { key: "gpt_model_type" value: { string_value: "inflight_fused_batching" } } parameters: { key: "gpt_model_path" value: { string_value: "/opt/tritonserver/inflight_batcher_llm/tensorrt_llm/1/" } } parameters: { key: "max_tokens_in_paged_kv_cache" value: { string_value: "${max_tokens_in_paged_kv_cache}" } } parameters: { key: "max_attention_window_size" value: { string_value: "${max_attention_window_size}" } } parameters: { key: "batch_scheduler_policy" value: { string_value: "${batch_scheduler_policy}" } } parameters: { key: "kv_cache_free_gpu_mem_fraction" value: { string_value: "0.9" } } parameters: { key: "enable_trt_overlap" value: { string_value: "${enable_trt_overlap}" } } parameters: { key: "exclude_input_in_output" value: { string_value: "true" } } parameters: { key: "enable_kv_cache_reuse" value: { string_value: "true" } } parameters: { key: "normalize_log_probs" value: { string_value: "${normalize_log_probs}" } } parameters: { key: "enable_chunked_context" value: { string_value: "${enable_chunked_context}" } } parameters: { key: "gpu_device_ids" value: { string_value: "${gpu_device_ids}" } } parameters: { key: "lora_cache_optimal_adapter_size" value: { string_value: "${lora_cache_optimal_adapter_size}" } } parameters: { key: "lora_cache_max_adapter_size" value: { string_value: "${lora_cache_max_adapter_size}" } } parameters: { key: "lora_cache_gpu_memory_fraction" value: { string_value: "0" } } parameters: { key: "lora_cache_host_memory_bytes" value: { string_value: "${lora_cache_host_memory_bytes}" } } parameters: { key: "decoding_mode" value: { string_value: "${decoding_mode}" } } parameters: { key: "worker_path" value: { string_value: "/opt/tritonserver/backends/tensorrtllm/triton_tensorrtllm_worker" } } parameters: { key: "medusa_choices" value: { string_value: "${medusa_choices}" } } ```Expected behavior TensorRT launches normally on GPU 1, as specified