triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
588 stars 81 forks source link

Verify if inflight batching is running #378

Closed bprus closed 1 month ago

bprus commented 3 months ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

I follow official examples for Llama model: https://github.com/NVIDIA/TensorRT-LLM/tree/v0.8.0/examples/llama I'm able to set everything up, and everything runs smoothly. I have inflight batching turned on in the model.

However, when I run benchmark_core_model.py:

python3 benchmark_core_model.py -i grpc --max-input-len 1024 --num-requests 200 token-norm-dist --input-mean 128 --input-stdev 5 --output-mean 500 --output-stdev 20

I get:

[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 200 prompts.
[INFO] Total Latency: 92768.536 ms
[INFO] Total request latencies: 11858259.309999997 ms
+----------------------------+----------+
|            Stat            |  Value   |
+----------------------------+----------+
|        Requests/Sec        |   2.16   |
|       OP tokens/sec        |  804.54  |
|     Avg. latency (ms)      | 59291.30 |
|      P99 latency (ms)      | 92361.86 |
|      P90 latency (ms)      | 86114.03 |
| Avg. IP tokens per request |  128.19  |
| Avg. OP tokens per request |  373.18  |
|   Avg. InFlight requests   |   0.00   |
|     Total latency (ms)     | 92768.17 |
|       Total requests       |  200.00  |
+----------------------------+----------+

I wonder why Avg. InFlight requests is 0.0. Do I need to set anything to use inflight batching?

I build the model with:

python3 convert_checkpoint.py --model_dir meta-llama/Llama-2-13b-chat-hf \
                              --output_dir /models/rt/Llama-2-13b-chat-hf_4gpu_fp16 \
                              --dtype float16 \
                              --tp_size 4 \
                              --workers 4

trtllm-build --checkpoint_dir /models/rt/Llama-2-13b-chat-hf_4gpu_fp16 \
             --output_dir /models/engines/Llama-2-13b-chat-hf_4gpu_fp16_pc \
             --gemm_plugin float16 \
             --workers 4 \
             --use_custom_all_reduce disable \
             --remove_input_padding enable \
             --use_paged_context_fmha enable \
             --max_batch_size 64

Triton Server logs:

{"Active Request Count":64,"Context Requests":0,"Free KV cache blocks":325,"Generation Requests":64,"Iteration Counter":1278,"Max KV cache blocks":581,"Max Request Count":64,"MicroBatch ID":0,"Runtime CPU Memory Usage":1956,"Runtime GPU Memory Usage":41488128,"Runtime Pinned Memory Usage":1073742848,"Scheduled Requests":64,"Terminated Requests":0,"Timestamp":"03-18-2024 13:56:15","Tokens per KV cache block":128,"Total Context Tokens":0,"Used KV cache blocks":256}

The logs suggest that there is some kind of batching "Active Request Count":64 and "Scheduled Requests":64.

Please help me and recommend how I can correctly verify if inflight batching is enabled and working as expected.

Expected behavior

Working inflight batching.

actual behavior

I'm not sure if inflight batching is working as expected.

additional notes

My model config:

name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 64

model_transaction_policy {
  decoupled: False
}

dynamic_batching {
    max_queue_delay_microseconds: 1000000
    preferred_batch_size: [ 64 ]
}

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    allow_ragged_batch: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "draft_input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "embedding_bias"
    data_type: TYPE_FP32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "frequency_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_context_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_generation_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "streaming"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "prompt_embedding_table"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "prompt_vocab_size"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  # weights for a lora adapter shape [ num_lora_modules_layers, D x Hi + Ho x D ]
  # where the last dimension holds the in / out adapter weights for the associated module (e.g. attn_qkv) and model layer
  # each of the in / out tensors are first flattened and then concatenated together in the format above.
  # D=adapter_size (R value), Hi=hidden_size_in, Ho=hidden_size_out.
  {
    name: "lora_weights"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  # module identifier (same size a first dimension of lora_weights)
  # See LoraModule::ModuleType for model id mapping
  #
  # "attn_qkv": 0     # compbined qkv adapter
  # "attn_q": 1       # q adapter
  # "attn_k": 2       # k adapter
  # "attn_v": 3       # v adapter
  # "attn_dense": 4   # adapter for the dense layer in attention
  # "mlp_h_to_4h": 5  # for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection
  # "mlp_4h_to_h": 6  # for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection
  # "mlp_gate": 7     # for llama2 adapter for gated mlp later after attention / RMSNorm: gate
  #
  # last dim holds [ module_id, layer_idx, adapter_size (D aka R value) ]
  {
    name: "lora_config"
    data_type: TYPE_INT32
    dims: [ -1, 3 ]
    optional: true
    allow_ragged_batch: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "generation_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
parameters: {
  key: "max_beam_width"
  value: {
    string_value: "1"
  }
}
parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value: "no"
  }
}
parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_fused_batching"
  }
}
parameters: {
  key: "gpt_model_path"
  value: {
    string_value: "/models/engines/Llama-2-13b-chat-hf_4gpu_fp16_wq4_pc"
  }
}
parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {
    string_value: "${max_tokens_in_paged_kv_cache}"
  }
}
parameters: {
  key: "max_attention_window_size"
  value: {
    string_value: "2560"
  }
}
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "max_utilization"
  }
}
parameters: {
  key: "kv_cache_free_gpu_mem_fraction"
  value: {
    string_value: "0.9"
  }
}
parameters: {
  key: "enable_trt_overlap"
  value: {
    string_value: "${enable_trt_overlap}"
  }
}
parameters: {
  key: "exclude_input_in_output"
  value: {
    string_value: "True"
  }
}
parameters: {
  key: "enable_kv_cache_reuse"
  value: {
    string_value: "False"
  }
}
parameters: {
  key: "normalize_log_probs"
  value: {
    string_value: "${normalize_log_probs}"
  }
}
parameters: {
  key: "enable_chunked_context"
  value: {
    string_value: "${enable_chunked_context}"
  }
}
parameters: {
  key: "gpu_device_ids"
  value: {
    string_value: "${gpu_device_ids}"
  }
}
XiaobingSuper commented 2 months ago

I also have the same issue, any update?

XiaobingSuper commented 2 months ago

For the code https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/tools/utils/utils.py#L384, the value of avg_in_flight_requests is never updated.

schetlur-nv commented 2 months ago

Hey, that value is not implemented today in the code, and is hard coded to 0. This does not mean IFB is not active. We'll try to I would suggest removing dynamic batching and preferred_batch_size from the triton config. If you'd like, you can inspect per-iteration statistics (https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#triton-metrics is probably easiest), which will tell you how many prompt and generation requests are in each iteration. Having > 0 of both in any iteration is conclusive evidence of IFB working.