triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
706 stars 106 forks source link

Expected batch dimension to be 1 for each request for input_ids #319

Open calvinh99 opened 9 months ago

calvinh99 commented 9 months ago

System Info

Hardware:

Libraries:

Who can help?

@juney-nvidia @byshiue

Information

Tasks

Reproduction

The problem is encountered when running the triton inference server on the docker container.

  1. Building docker container
git clone -b main  https://github.com/triton-inference-server/tensorrtllm_backend.git

cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive

DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
  1. Run the docker container
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /.../tensorrtllm_backend:/tensorrtllm_backend triton_trt_llm bash
  1. Install libraries (inside docker container)
(cd tensorrt_llm &&
    bash docker/common/install_cmake.sh &&
    export PATH=/usr/local/cmake/bin:$PATH &&
    python3 ./scripts/build_wheel.py --trt_root="/usr/local/tensorrt" &&
    pip3 install ./build/tensorrt_llm*.whl)
  1. Build tensorrt_engine using tensorrtllm_backend/tensorrt_llm/examples/llama/build.py (used Mistral 7B Instruct weights)
python build.py \
--model_dir ./mistral/merged-mistral-ckpt-10/ \
--dtype float16 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--use_inflight_batching \
--enable_context_fmha \
--use_gemm_plugin float16 \
--paged_kv_cache \
--max_input_len 3096 \
--max_output_len 512 \
--world_size 1 \
--tp_size 1 \
--max_batch_size 64 \
--output_dir ./mistral/ckpt10/engines/fp16/1-gpu/
  1. Created triton repo and copied my engine files over, also set the config.pbtxt for tensorrt_llm model
cd tensorrtllm_backend
mkdir triton_model_repo
cp -r all_models/inflight_batcher_llm/* triton_model_repo.

# edited the config.pbtxt for the models, will paste the config.pbtxt below

cp /tensorrtllm_backend/tensorrt_llm/examples/llama/mistral/ckpt10/engines/fp16/1-gpu/* triton_model_repo/tensorrt_llm/1

The triton_model_repo/tensorrt_llm/1/config.pbtxt:

name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 1024

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    allow_ragged_batch: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value: "no"
  }
}
parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_fused_batching"
  }
}
parameters: {
  key: "gpt_model_path"
  value: {
    string_value: "/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1"
  }
}
parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {
    string_value: "4096"
  }
}
parameters: {
  key: "max_attention_window_size"
  value: {
    string_value: "4096"
  }
}
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "max_utilization"
  }
}
parameters: {
  key: "max_num_sequences"
  value: {
    string_value: "64"
  }
}
parameters: {
  key: "enable_trt_overlap"
  value: {
    string_value: "true"
  }
}
parameters: {
  key: "exclude_input_in_output"
  value: {
    string_value: "true"
  }
}
  1. Run the triton inference server (ports are different cuz 8000 was in use)
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo --grpc_port=8006 --http_port=8007 --metrics_port=8008
  1. My python code for inference
    
    # From the inflight_batcher_llm_client.py script
    def prepare_tensor(name, input):
    t = httpclient.InferInput(name, input.shape,
                                np_to_triton_dtype(input.dtype))
    t.set_data_from_numpy(input)
    return t

my script

def test_batched_request(): max_input_len = 3096 max_output_len = 512

# prepare text inputs
batch_text_input = ... # is a list of strings
batch_size = len(batch_text_input)

# Batched requests aren't working due to some error if batch dimension is not 1
batch_input_ids = tokenize_inputs(batch_text_input) # NOT padded to same sequence lengths yet
batch_input_lengths = np.array([len(x) for x in batch_input_ids], dtype=np.int32).reshape(-1, 1)
request_output_len = np.array([max_output_len] * batch_size, dtype=np.int32).reshape(-1, 1)
end_id = np.array([2] * batch_size, dtype=np.int32).reshape(-1, 1)

# Must pad or numpy throws error
padded_batch_ids = np.full((batch_size, max_input_len), 2, dtype=np.int32)
for i, input_ids in enumerate(batch_input_ids):
    padded_batch_ids[i, :len(input_ids)] = input_ids
padded_batch_ids = padded_batch_ids.astype(np.int32)

assert padded_batch_ids.shape == (batch_size, max_input_len)
assert batch_input_lengths.shape == (batch_size, 1)
assert request_output_len.shape == (batch_size, 1)
assert end_id.shape == (batch_size, 1)

# Create inference request:
model_name = "tensorrt_llm"

def prepare_tensor(name, input):
    t = httpclient.InferInput(
        name, input.shape, np_to_triton_dtype(input.dtype))
    t.set_data_from_numpy(input)
    return t

inputs = [
    prepare_tensor("input_ids", padded_batch_ids),
    prepare_tensor("input_lengths", batch_input_lengths),
    prepare_tensor("request_output_len", request_output_len),
    prepare_tensor("end_id", end_id),
]

with httpclient.InferenceServerClient(url="localhost:8007", verbose=False) as client:
    infer_future = client.async_infer(model_name, inputs)
    result = infer_future.get_result()
    output = result.as_numpy('output_ids')

8. THE ERROR
```sh
Traceback (most recent call last):
  File "/.../triton_api.py", line 310, in <module>
    test_batched_request();
  File "/.../triton_api.py", line 223, in test_batched_request
    result = infer_future.get_result()
  File "/.../myenv/lib/python3.9/site-packages/tritonclient/http/_client.py", line 96, in get_result
    _raise_if_error(response)
  File "/.../myenv/lib/python3.9/site-packages/tritonclient/http/_utils.py", line 69, in _raise_if_error
    raise error
tritonclient.utils.InferenceServerException: [400] Encountered error for requestId 1714636916: Cannot process new request: [TensorRT-LLM][ERROR] Assertion failed: Expected batch dimension to be 1 for each request for input_ids (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:234)
1       0x7f49e8ad66fd /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x176fd) [0x7f49e8ad66fd]
2       0x7f49e8ad9914 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1a914) [0x7f49e8ad9914]
3       0x7f49e8b2d928 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6e928) [0x7f49e8b2d928]
4       0x7f49e8b2fea8 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x70ea8) [0x7f49e8b2fea8]
5       0x7f49e8b31b08 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x72b08) [0x7f49e8b31b08]
6       0x7f4a19664253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f4a19664253]
7       0x7f4a193f4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f4a193f4ac3]
8       0x7f4a19485bf4 clone + 68

I've searched again and again, but couldn't find info anywhere on this error [TensorRT-LLM][ERROR] Assertion failed: Expected batch dimension to be 1 for each request for input_ids (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:234)

Expected behavior

I expect that it should take the inputs with batch_size more than 1 (it was 6 in this case) and perform batched inference.

When I tested batched inference inside the docker container using tensorrtllm_backend/tensorrt_llm/examples/run.py using the same tensorrt engine, it worked. But the triton inference server isn't?

For reference this is how I ran it inside the docker container (not via triton inference server). This was inside dir tensorrtllm_backend/tensorrt_llm/examples.

python run.py \
--max_input_len=3096 \
--max_output_len=512 \
--tokenizer_dir ./llama/mistral/merged-mistral-ckpt-10/ \
--engine_dir=./llama/mistral/ckpt10/engines/fp16/1-gpu/ \
--max_attention_window_size=4096 \
--input_file=./llama/inputs_1.csv

It worked fine here.

I'm really not sure why this error happens, couldn't find anything about it anywhere.

actual behavior

The triton server throws error expecting batch dimension to be 1.

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 23.10 (build 72127154)
Triton Server Version 2.39.0

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

root@sn4622118035:/app# python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo --grpc_port=8006 --http_port=8007 --metrics_port=8008
root@sn4622118035:/app# I0128 17:39:49.681249 111 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7efc9a000000' with size 268435456
I0128 17:39:49.686833 111 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0128 17:39:49.686841 111 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864
I0128 17:39:49.807320 111 model_lifecycle.cc:461] loading: tensorrt_llm:1
[TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
[TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.9 or max_tokens_in_paged_kv_cache
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
[TensorRT-LLM][WARNING] Parameter version cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'version' not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][WARNING] The value of maxAttentionWindow cannot exceed mMaxSeqLen. Therefore, it has been adjusted to match the value of mMaxSeqLen.
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 64
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 3608
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] Loaded engine size: 13815 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13863, GPU 14445 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 13864, GPU 14455 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +13812, now: CPU 0, GPU 13812 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +8, now: CPU 13893, GPU 14869 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 13893, GPU 14877 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13812 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13913, GPU 14897 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 13913, GPU 14907 (MiB)
[TensorRT-LLM][WARNING] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.4
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 13812 (MiB)
[TensorRT-LLM][INFO] Allocate 536870912 bytes for k/v cache. 
[TensorRT-LLM][INFO] Using 4096 total tokens in paged KV cache, and 29 blocks per sequence
[TensorRT-LLM][WARNING] max_num_sequences is smaller than  2 times the engine max_batch_size. Batches smaller than max_batch_size will be executed.
I0128 17:39:56.304525 111 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm'
I0128 17:39:56.304688 111 server.cc:592] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0128 17:39:56.304764 111 server.cc:619] 
+-------------+-----------------------------------+-----------------------------------+
| Backend     | Path                              | Config                            |
+-------------+-----------------------------------+-----------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensor | {"cmdline":{"auto-complete-config |
|             | rtllm/libtriton_tensorrtllm.so    | ":"false","backend-directory":"/o |
|             |                                   | pt/tritonserver/backends","min-co |
|             |                                   | mpute-capability":"6.000000","def |
|             |                                   | ault-max-batch-size":"4"}}        |
|             |                                   |                                   |
+-------------+-----------------------------------+-----------------------------------+

I0128 17:39:56.304799 111 server.cc:662] 
+--------------+---------+--------+
| Model        | Version | Status |
+--------------+---------+--------+
| tensorrt_llm | 1       | READY  |
+--------------+---------+--------+

I0128 17:39:56.373242 111 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA A100 80GB PCIe
I0128 17:39:56.373267 111 metrics.cc:817] Collecting metrics for GPU 1: NVIDIA A100 80GB PCIe
I0128 17:39:56.373495 111 metrics.cc:710] Collecting CPU metrics
I0128 17:39:56.373595 111 tritonserver.cc:2458] 
+----------------------------------+----------------------------------------------------+
| Option                           | Value                                              |
+----------------------------------+----------------------------------------------------+
| server_id                        | triton                                             |
| server_version                   | 2.39.0                                             |
| server_extensions                | classification sequence model_repository model_rep |
|                                  | ository(unload_dependents) schedule_policy model_c |
|                                  | onfiguration system_shared_memory cuda_shared_memo |
|                                  | ry binary_tensor_data parameters statistics trace  |
|                                  | logging                                            |
| model_repository_path[0]         | /tensorrtllm_backend/triton_model_repo             |
| model_control_mode               | MODE_NONE                                          |
| strict_model_config              | 1                                                  |
| rate_limit                       | OFF                                                |
| pinned_memory_pool_byte_size     | 268435456                                          |
| cuda_memory_pool_byte_size{0}    | 67108864                                           |
| cuda_memory_pool_byte_size{1}    | 67108864                                           |
| min_supported_compute_capability | 6.0                                                |
| strict_readiness                 | 1                                                  |
| exit_timeout                     | 30                                                 |
| cache_enabled                    | 0                                                  |
+----------------------------------+----------------------------------------------------+

I0128 17:39:56.375119 111 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8006
I0128 17:39:56.375286 111 http_server.cc:4497] Started HTTPService at 0.0.0.0:8007
I0128 17:39:56.434757 111 http_server.cc:270] Started Metrics Service at 0.0.0.0:8008

root@sn4622118035:/app# [TensorRT-LLM][ERROR] Cannot process new request: [TensorRT-LLM][ERROR] Assertion failed: Expected batch dimension to be 1 for each request for input_ids (/home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/llm/cpp/tensorrt_llm/batch_manager/GptManager.cpp:234)
1       0x7f49e8ad66fd /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x176fd) [0x7f49e8ad66fd]
2       0x7f49e8ad9914 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1a914) [0x7f49e8ad9914]
3       0x7f49e8b2d928 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6e928) [0x7f49e8b2d928]
4       0x7f49e8b2fea8 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x70ea8) [0x7f49e8b2fea8]
5       0x7f49e8b31b08 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x72b08) [0x7f49e8b31b08]
6       0x7f4a19664253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f4a19664253]
7       0x7f4a193f4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f4a193f4ac3]
8       0x7f4a19485bf4 clone + 68

additional notes

Some additional issues that I'm not too sure why they occur (I fixed these by just changing the config.pbtxt, but don't have real understanding of why).

When I set the "max_tokens_in_paged_kv_cache" parameter in my config to 8192, the server started treating my inputs as incorrectly shaped (even though I made no other changes). When I change it back to 4096, everything works fine again.

parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {
    string_value: "4096" # If this gets changed to 8192, server starts thinking my inputs have wrong shape
  }
}

Also, the documentation for batched requests points to the example script tensorrtllm_backend/inflight_batcher_llm/client/inflight_batcher_llm_client.py.

But the default code seems to force batch size to 1, check here.

The other parts of this example script were very helpful, just couldn't find any clues to inference beyond a batch size of 1.

caronzh03 commented 9 months ago

I'm running the same exact issue, any timeline for the fix?

pcastonguay commented 8 months ago

Currently, the C++ Triton backend only accepts batch size 1 requests. We use in-flight batching to create larger batches from those batch size 1 requests. We don't have a timeline for supporting batch size > 1 requests with in-flight batching.

ZihanLiao commented 8 months ago

any timeline for the fi

So you mean I cannot do splitting a paragraph of text into a batch of sentences? This kind of request would fail right?

pcastonguay commented 8 months ago

You could just send multiple requests, each request containing a single sentence.