triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
589 stars 83 forks source link

Input tensor 'host_sink_token_length' not found when launch llama2-7b. #285

Open xxyux opened 6 months ago

xxyux commented 6 months ago

I installed tensorrtllm_backend in the follow way:

  1. docker pull nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3
  2. docker run -v /data2/share/:/data/ -v /mnt/sdb/benchmark/xiangrui:/root -it -d --cap-add=SYS_PTRACE --cap-add=SYS_ADMIN --security-opt seccomp=unconfined --gpus=all --shm-size=16g --privileged --ulimit memlock=-1 --name=develop nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3 bash
  3. git clone git@github.com:triton-inference-server/tensorrtllm_backend.git --recursive
  4. apt-get update && apt-get -y install git git-lfs pip install cmake cd tensorrtllm_backend git submodule update --init --recursive git lfs install git lfs pull
  5. cd tensorrt_llm python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt pip installtensorrt_llm-0.7.1-cp310-cp310-linux_x86_64.whl
  6. now, I installed tensorrt_llm
    >>> import tensorrt
    >>> import tensorrt_llm
    >>> tensorrt.__version__
    '9.2.0.post12.dev5'
    >>> tensorrt_llm.__version__
    '0.7.1'
  7. follow this doc https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md
  8. until Launch server, I meet this issue https://github.com/NVIDIA/TensorRT-LLM/issues/656 ,Assertion failed: d == a + length (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp:326) then, I do this cp tensorrt_llm/build/lib/tensorrt_llm/libs/* /opt/tritonserver/backends/tensorrtllm/, solved problem.
  9. After that, I launched server successfully. use this command python3 scripts/launch_triton_server.py --world_size 1 --model_repo=triton_model_repo/ messages shown blow:
    
    TensorRT-LLM][INFO] Allocate 1342177280 bytes for k/v cache. 
    [TensorRT-LLM][INFO] Using 2560 total tokens in paged KV cache, and 20 blocks per sequence
    I0107 12:18:03.563733 4223 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm'
    I0107 12:18:03.565248 4223 model_lifecycle.cc:461] loading: ensemble:1
    I0107 12:18:03.565653 4223 model_lifecycle.cc:818] successfully loaded 'ensemble'
    I0107 12:18:03.565748 4223 server.cc:606] 
    +------------------+------+
    | Repository Agent | Path |
    +------------------+------+
    +------------------+------+
I0107 12:18:03.565813 4223 server.cc:633] +-------------+-------------------------------------------------------------+-------------------------------------------------------------+ Backend Path Config +-------------+-------------------------------------------------------------+-------------------------------------------------------------+ python /opt/tritonserver/backends/python/libtriton_python.so {"cmdline":{"auto-complete-config":"false","backend-directo ry":"/opt/tritonserver/backends","min-compute-capability":" 6.000000","shm-region-prefix-name":"prefix0_","default-max- batch-size":"4"}} tensorrtllm /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtll {"cmdline":{"auto-complete-config":"false","backend-directo m.so ry":"/opt/tritonserver/backends","min-compute-capability":" 6.000000","default-max-batch-size":"4"}}

+-------------+-------------------------------------------------------------+-------------------------------------------------------------+

I0107 12:18:03.565860 4223 server.cc:676] +------------------+---------+--------+ | Model | Version | Status | +------------------+---------+--------+ | ensemble | 1 | READY | | postprocessing | 1 | READY | | preprocessing | 1 | READY | | tensorrt_llm | 1 | READY | | tensorrt_llm_bls | 1 | READY | +------------------+---------+--------+

I0107 12:18:03.712204 4223 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA RTX A6000 I0107 12:18:03.726368 4223 metrics.cc:710] Collecting CPU metrics I0107 12:18:03.726577 4223 tritonserver.cc:2483] +----------------------------------+------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.41.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_c | | | onfiguration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace | | | logging | | model_repository_path[0] | triton_model_repo/ | | model_control_mode | MODE_NONE | | strict_model_config | 1 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+------------------------------------------------------------------------------------------------------+

I0107 12:18:03.744520 4223 grpc_server.cc:2495] Started GRPCInferenceService at 0.0.0.0:8001 I0107 12:18:03.744823 4223 http_server.cc:4619] Started HTTPService at 0.0.0.0:8000 I0107 12:18:03.804746 4223 http_server.cc:282] Started Metrics Service at 0.0.0.0:8002

11. Test, when I use command `curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'`
Error happen, they are:

root@d0b11d0dea8b:/tensorrtllm_backend# curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}' [TensorRT-LLM][ERROR] Encountered an error in forward function: Input tensor 'host_sink_token_length' not found; expected shape: (1) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:124) 1 0x7f830f5793e3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1273e3) [0x7f830f5793e3] 2 0x7f830f4cbeb1 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x79eb1) [0x7f830f4cbeb1] 3 0x7f830f4ccfa6 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7afa6) [0x7f830f4ccfa6] 4 0x7f830f4d0f0d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7ef0d) [0x7f830f4d0f0d] 5 0x7f830f4bba28 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x69a28) [0x7f830f4bba28] 6 0x7f830f4bffb5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6dfb5) [0x7f830f4bffb5] 7 0x7f838604f253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f838604f253] 8 0x7f8385ddfac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8385ddfac3] 9 0x7f8385e71660 /lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7f8385e71660] [TensorRT-LLM][ERROR] Encountered error for requestId 1804289384: Encountered an error in forward function: Input tensor 'host_sink_token_length' not found; expected shape: (1) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:124) 1 0x7f830f5793e3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1273e3) [0x7f830f5793e3] 2 0x7f830f4cbeb1 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x79eb1) [0x7f830f4cbeb1] 3 0x7f830f4ccfa6 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7afa6) [0x7f830f4ccfa6] 4 0x7f830f4d0f0d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7ef0d) [0x7f830f4d0f0d] 5 0x7f830f4bba28 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x69a28) [0x7f830f4bba28] 6 0x7f830f4bffb5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6dfb5) [0x7f830f4bffb5] 7 0x7f838604f253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f838604f253] 8 0x7f8385ddfac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8385ddfac3] 9 0x7f8385e71660 /lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7f8385e71660] [TensorRT-LLM][WARNING] Step function failed, continuing. {"error":"in ensemble 'ensemble', Encountered error for requestId 1804289384: Encountered an error in forward function: Input tensor 'host_sink_token_length' not found; expected shape: (1) (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:124)\n1 0x7f830f5793e3 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x1273e3) [0x7f830f5793e3]\n2 0x7f830f4cbeb1 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x79eb1) [0x7f830f4cbeb1]\n3 0x7f830f4ccfa6 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7afa6) [0x7f830f4ccfa6]\n4 0x7f830f4d0f0d /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x7ef0d) [0x7f830f4d0f0d]\n5 0x7f830f4bba28 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x69a28) [0x7f830f4bba28]\n6 0x7f830f4bffb5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x6dfb5) [0x7f830f4bffb5]\n7 0x7f838604f253 /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f838604f253]\n8 0x7f8385ddfac3 /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f8385ddfac3]\n9 0x7f8385e71660 /lib/x86_64-linux-gnu/libc.so.6(+0x126660) [0x7f8385e71660]"}root@d0b11d0dea8b:/tensorrtllm_backend# root@d0b11d0dea8b:/tensorrtllm_backend# tmux a



Can anyone help me? plsssss.
xxyux commented 6 months ago

It seems like I can not do cp tensorrt_llm/build/lib/tensorrt_llm/libs/* /opt/tritonserver/backends/tensorrtllm/ this command. cause error happened in opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so. am I right?

xxyux commented 6 months ago

@byshiue pls.

xxyux commented 6 months ago

Both git branch is on main:

root@d0b11d0dea8b:/tensorrtllm_backend# git branch -av
* main                                   6e6e34e Update TensorRT-LLM backend (#272)
  remotes/origin/HEAD                    -> origin/main
  remotes/origin/fpetrini-cli-dev        fda8635 Don't uninstall trt_llm
  remotes/origin/fpetrini-triton-metrics 226c3c0 Updated gen script
  remotes/origin/kaiyu/update-rel        9edd83a Update version.txt
  remotes/origin/krish-fix-test          922b0e1 Fix test
  remotes/origin/krish-trtllm-size       98c0a5f Fix up
  remotes/origin/main                    6e6e34e Update TensorRT-LLM backend (#272)
  remotes/origin/r23.12                  9aedcf3 Update TensorRT-LLM backend (#241)
  remotes/origin/rel                     4344654 Update TensorRT-LLM backend release branch (#260)
  remotes/origin/release/0.5.0           47b609b Update doc (#78)
root@d0b11d0dea8b:/tensorrtllm_backend# cd tensorrt_llm/
root@d0b11d0dea8b:/tensorrtllm_backend/tensorrt_llm# git branch -av
* main                         6cc5e17 Update issue templates
  remotes/origin/HEAD          -> origin/main
  remotes/origin/gh-pages      0a75cdb Update gh-pages (#750)
  remotes/origin/main          6cc5e17 Update issue templates
  remotes/origin/rel           2f169d1 Add batch manager static lib for Windows (#814)
  remotes/origin/release/0.5.0 a21e2f8 Fix an issue of 
byshiue commented 6 months ago

The tensorrt version of nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3 is v0.7.0, so you will encounter such issue when you build engine with the v0.7.1. I sugges using the docker file to build docker image again to make sure your tritonserver docker also install v0.7.1.

xxyux commented 6 months ago

The tensorrt version of nvcr.io/nvidia/tritonserver:23.12-trtllm-python-py3 is v0.7.0, so you will encounter such issue when you build engine with the v0.7.1. I sugges using the docker file to build docker image again to make sure your tritonserver docker also install v0.7.1.

THX sir!!!

So, I should use this command to build image, which tensrrt_llm version is v0.7.1.

# Update the submodules
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive
# Use the Dockerfile to build the backend in a container
# For x86_64
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

and follow the remaining steps(from to step5 [build tensorrt_llm]) in my issue?

@byshiue pls

byshiue commented 5 months ago

Yes.

xxyux commented 5 months ago

Thanks! I installed tensorrtllm_backend successfully using the image built in this command DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend . After launched server, I test in the follow ways described in this doc.

  1. Send request
    # Ask:
    curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'
    # Answer:
    {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\nMachine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate"}
  2. Send request by inflight_batcher_llm_client.py
    root@ps:/tensorrtllm_backend# export HF_LLAMA_MODEL=/data/llama/Llama-2-7b-hf/
    root@ps:/tensorrtllm_backend# python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 --tokenizer-dir ${HF_LLAMA_MODEL}
    =========
    Input sequence:  [1, 19298, 297, 6641, 29899, 23027, 3444, 29892, 1105, 7598, 16370, 408, 263]
    Got completed request
    Input: Born in north-east France, Soyer trained as a
    Output beam 0: . He was a member of the Société des Artistes Français and exhibited at the Paris Salon from 1861. He was also a member of the Société des Artistes Indépendants.
    Soyer was a painter of genre scenes, portraits and landscapes. He was also a lithographer and etcher.
    Soyer was a friend of the composer Hector Berlioz and the writer Victor Hugo.
    Soyer died in Paris in 1907.
    The artist's works can be found in the collections of the Musée d'Orsay in Paris, the Musée des Beaux-Arts in Nancy, the Musée des Beaux-Arts in Rouen, the Musée des Beaux-Arts in Reims, the Musée des Beaux-Arts in Lille, the Musée des Beaux-Arts in Le
    Output sequence:  [23187, 472, 278, 3067, 10936, 553, 1522, 2993, 29899, 1433, 1372, 297, 3681, 29889, 940, 471, 263, 4509, 310, 278, 21903, 553, 3012, 9230, 1352, 6899, 322, 10371, 1573, 472, 278, 3681, 3956, 265, 515, 29871, 29896, 29947, 29953, 29896, 29889, 940, 471, 884, 263, 4509, 310, 278, 21903, 553, 3012, 9230, 1894, 6430, 355, 1934, 29889, 13, 6295, 7598, 471, 263, 23187, 310, 16151, 20407, 29892, 2011, 336, 1169, 322, 2982, 1557, 11603, 29889, 940, 471, 884, 263, 301, 389, 1946, 261, 322, 634, 4630, 29889, 13, 6295, 7598, 471, 263, 5121, 310, 278, 18422, 379, 3019, 2292, 492, 2112, 322, 278, 9227, 12684, 20650, 29889, 13, 6295, 7598, 6423, 297, 3681, 297, 29871, 29896, 29929, 29900, 29955, 29889, 13, 1576, 7664, 29915, 29879, 1736, 508, 367, 1476, 297, 278, 16250, 310, 278, 26273, 270, 29915, 29949, 2288, 388, 297, 3681, 29892, 278, 26273, 553, 1522, 2993, 29899, 1433, 1372, 297, 24190, 29892, 278, 26273, 553, 1522, 2993, 29899, 1433, 1372, 297, 15915, 264, 29892, 278, 26273, 553, 1522, 2993, 29899, 1433, 1372, 297, 830, 9893, 29892, 278, 26273, 553, 1522, 2993, 29899, 1433, 1372, 297, 365, 1924, 29892, 278, 26273, 553, 1522, 2993, 29899, 1433, 1372, 297, 951]
    Exception ignored in: <function InferenceServerClient.__del__ at 0x7fd52f563370>
    Traceback (most recent call last):
    File "/usr/local/lib/python3.10/dist-packages/tritonclient/grpc/_client.py", line 257, in __del__
    File "/usr/local/lib/python3.10/dist-packages/tritonclient/grpc/_client.py", line 265, in close
    File "/usr/local/lib/python3.10/dist-packages/grpc/_channel.py", line 2101, in close
    File "/usr/local/lib/python3.10/dist-packages/grpc/_channel.py", line 2082, in _close
    AttributeError: 'NoneType' object has no attribute 'StatusCode'

    but there has been an AttributeError: 'NoneType' object has no attribute 'StatusCode'. What causes this error? How to solve it? @byshiue plss

kaiyux commented 5 months ago

but there has been an AttributeError: 'NoneType' object has no attribute 'StatusCode'.

@xxyux It's very likely that the issue is in one of the dependencies of TensorRT-LLM backend. I tried pip3 install -r requirements.txt to update the dependencies, and the issue is gone. Could you please try that as well?

lyc728 commented 5 months ago

but there has been an AttributeError: 'NoneType' object has no attribute 'StatusCode'.

@xxyux It's very likely that the issue is in one of the dependencies of TensorRT-LLM backend. I tried pip3 install -r requirements.txt to update the dependencies, and the issue is gone. Could you please try that as well?

hello , ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. tensorrt-llm 0.7.0 requires transformers==4.33.1, but you have transformers 4.31.0 which is incompatible. And the eroor AttributeError: 'NoneType' object has no attribute 'StatusCode' also happen

zhouxiao999 commented 3 months ago

The same error( 'NoneType' object has no attribute 'StatusCode'. )got when I test inflight_batcher_llm_client.py. I tried "pip3 install -r requirements.txt" but useless tensorrt 9.2.0.post12.dev5 tensorrt-llm 0.8.0 torch 2.1.2 transformers 4.36.1 triton 2.1.0

root@acce067401db:/home/zy/data8tb/zx/tensorrtllm_backend# python3 inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 --tokenizer-dir ${HF_LLAMA_MODEL}
=========
Input sequence:  [1, 19298, 297, 6641, 29899, 23027, 3444, 29892, 1105, 7598, 16370, 408, 263]
Got completed request
Input: Born in north-east France, Soyer trained as a
Output beam 0: . He was a member of the Société des Artistes Français and exhibited at the Paris Salon from 1861. He was also a member of the Société des Artistes Indépendants.
Soyer was a painter of genre scenes, portraits and landscapes. He was also a lithographer and etcher.
Soyer was a friend of the composer Hector Berlioz and the writer Victor Hugo.
Soyer died in Paris in 1907.
The artist's works can be found in the collections of the Musée d'Orsay in Paris, the Musée des Beaux-Arts in Nancy, the Musée des Beaux-Arts in Rouen, the Musée des Beaux-Arts in Reims, the Musée des Beaux-Ar                ts in Lille, the Musée des Beaux-Arts in Le
Output sequence:  [23187, 472, 278, 3067, 10936, 553, 1522, 2993, 29899, 1433, 1372, 297, 3681, 29889, 940, 471, 263, 4509, 310, 278, 21903, 553, 3012, 9230, 1352, 6899, 322, 10371, 1573, 472, 278, 3681, 3956,                 265, 515, 29871, 29896, 29947, 29953, 29896, 29889, 940, 471, 884, 263, 4509, 310, 278, 21903, 553, 3012, 9230, 1894, 6430, 355, 1934, 29889, 13, 6295, 7598, 471, 263, 23187, 310, 16151, 20407, 29892, 2011, 3                36, 1169, 322, 2982, 1557, 11603, 29889, 940, 471, 884, 263, 301, 389, 1946, 261, 322, 634, 4630, 29889, 13, 6295, 7598, 471, 263, 5121, 310, 278, 18422, 379, 3019, 2292, 492, 2112, 322, 278, 9227, 12684, 2065                0, 29889, 13, 6295, 7598, 6423, 297, 3681, 297, 29871, 29896, 29929, 29900, 29955, 29889, 13, 1576, 7664, 29915, 29879, 1736, 508, 367, 1476, 297, 278, 16250, 310, 278, 26273, 270, 29915, 29949, 2288, 388, 297                , 3681, 29892, 278, 26273, 553, 1522, 2993, 29899, 1433, 1372, 297, 24190, 29892, 278, 26273, 553, 1522, 2993, 29899, 1433, 1372, 297, 15915, 264, 29892, 278, 26273, 553, 1522, 2993, 29899, 1433, 1372, 297, 83                0, 9893, 29892, 278, 26273, 553, 1522, 2993, 29899, 1433, 1372, 297, 365, 1924, 29892, 278, 26273, 553, 1522, 2993, 29899, 1433, 1372, 297, 951]
Exception ignored in: <function InferenceServerClient.__del__ at 0x7f40ac90b370>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tritonclient/grpc/_client.py", line 257, in __del__
  File "/usr/local/lib/python3.10/dist-packages/tritonclient/grpc/_client.py", line 265, in close
  File "/usr/local/lib/python3.10/dist-packages/grpc/_channel.py", line 2181, in close
  File "/usr/local/lib/python3.10/dist-packages/grpc/_channel.py", line 2162, in _close
AttributeError: 'NoneType' object has no attribute 'StatusCode'
plt12138 commented 3 months ago

Same error. Triton: nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 tensorrtllm_backend: v0.8.0 Mixtral-8x7b

dtlzhuangz commented 3 months ago

same error +1

shiqingzhangCSU commented 3 months ago

same error +1 Tensorrtllm_backend: v0.8.0 model: llama7b

XiaobingSuper commented 2 months ago

same error +1 Tensorrtllm_backend: v0.8.0 model: llama7b

hscspring commented 1 month ago

do not init client, use with client