Open jiangshining opened 1 year ago
@jiangshining Thank you for the report. Can you share end to end reproduced steps to help us reproducing the issue you encounter? It is helpful to prevent spending too many time to align the test settings.
i've got the same Issue (baichuan2 13b, with max bs=2, concurrency=4 triton=23.08)
however, it seems to be solved when i tried triton trtllm 23.10 ngc image today, you can try to upgrade it maybe
@byshiue I came across the same issue by following steps (on 3090 GPU):
"args": [
"--max_batch_size", "4",
"--max_input_len", "4096",
"--max_output_len", "258",
"--model_dir", "./llama-2-7b-chat-hf",
"--dtype", "float16",
"--remove_input_padding",
"--use_gpt_attention_plugin", "float16",
"--enable_context_fmha",
"--use_gemm_plugin", "float16",
"--use_inflight_batching",
"--paged_kv_cache",
"--use_weight_only",
"--weight_only_precision", "int4",
"--output_dir", "./tmp/llama/7B/trt_engines/int4_inflightb_btsz4/1-gpu/"
]
max_batch_size: 4
model_transaction_policy {
decoupled: False
}
parameters: {
key: "gpt_model_type"
value: {
string_value: "inflight_fused_batching"
}
}
parameters: {
key: "gpt_model_path"
value: {
string_value: "/workspace/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1"
}
}
parameters: {
key: "batch_scheduler_policy"
value: {
string_value: "guaranteed_no_evict"
}
}
parameters: {
key: "kv_cache_free_gpu_mem_fraction"
value: {
string_value: "0.85"
}
}
parameters: {
key: "enable_trt_overlap"
value: {
string_value: "True"
}
}
ab -c 4 -n 4 -p data.json localhost:8000/v2/models/ensemble/generate
[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaEventSynchronize(get()): an illegal memory access was encountered (/app/tensorrt_llm/cpp/include/tensorrt_llm/runtime/cudaEvent.h:66)
1 0x7f155e81e045 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x36045) [0x7f155e81e045]
2 0x7f155e87fa8a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x97a8a) [0x7f155e87fa8a]
3 0x7f155e84c821 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64821) [0x7f155e84c821]
4 0x7f155e8515c7 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x695c7) [0x7f155e8515c7]
5 0x7f155e83b241 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x53241) [0x7f155e83b241]
6 0x7f155e83c38a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5438a) [0x7f155e83c38a]
7 0x7f16df264253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f16df264253]
8 0x7f16deff4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f16deff4ac3]
9 0x7f16df085bf4 clone + 68
[TensorRT-LLM][ERROR] Encountered error for requestId 1681692778: Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaEventSynchronize(get()): an illegal memory access was encountered (/app/tensorrt_llm/cpp/include/tensorrt_llm/runtime/cudaEvent.h:66)
1 0x7f155e81e045 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x36045) [0x7f155e81e045]
2 0x7f155e87fa8a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x97a8a) [0x7f155e87fa8a]
3 0x7f155e84c821 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64821) [0x7f155e84c821]
4 0x7f155e8515c7 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x695c7) [0x7f155e8515c7]
5 0x7f155e83b241 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x53241) [0x7f155e83b241]
6 0x7f155e83c38a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5438a) [0x7f155e83c38a]
7 0x7f16df264253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f16df264253]
8 0x7f16deff4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f16deff4ac3]
9 0x7f16df085bf4 clone + 68
[TensorRT-LLM][ERROR] Encountered error for requestId 1714636916: Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaEventSynchronize(get()): an illegal memory access was encountered (/app/tensorrt_llm/cpp/include/tensorrt_llm/runtime/cudaEvent.h:66)
...
@byshiue I seems to be success if I switch of the inflight batch (build without --use_inflight_batching and config triton)
@byshiue ,hi,have you replicated it according to col-in-coding's steps ?
I try to reproduce this issue by your config, and that results I get
bhsueh@75653932fc7c:/home/scratch.bhsueh_sw_1/workspace/TensorRT-LLM/tllm_backend_nvbug$ I1121 07:22:22.905217 1241 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f1c48000000' with size 268435456
I1121 07:22:22.908173 1241 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I1121 07:22:22.971625 1241 model_lifecycle.cc:461] loading: preprocessing:1
I1121 07:22:22.972690 1241 model_lifecycle.cc:461] loading: tensorrt_llm:1
I1121 07:22:22.973734 1241 model_lifecycle.cc:461] loading: postprocessing:1
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null
[TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 1
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 4
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 4
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1
[TensorRT-LLM][INFO] Loaded engine size: 3595 MiB
I1121 07:22:24.568350 1241 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)
I1121 07:22:24.617528 1241 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +64, now: CPU 3886, GPU 4319 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +4, GPU +72, now: CPU 3890, GPU 4391 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +3591, now: CPU 0, GPU 3591 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +64, now: CPU 3918, GPU 5551 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +64, now: CPU 3919, GPU 5615 (MiB)
I1121 07:22:25.232960 1241 model_lifecycle.cc:818] successfully loaded 'preprocessing'
I1121 07:22:25.247506 1241 model_lifecycle.cc:818] successfully loaded 'postprocessing'
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3591 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +64, now: CPU 3953, GPU 5717 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +70, now: CPU 3953, GPU 5787 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3591 (MiB)
[TensorRT-LLM][INFO] Using 8192 tokens in paged KV cache.
[TensorRT-LLM][WARNING] max_num_sequences is smaller than 2 times the engine max_batch_size. Batches smaller than max_batch_size will be executed.
I1121 07:22:25.718881 1241 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm'
I1121 07:22:25.720711 1241 model_lifecycle.cc:461] loading: ensemble:1
I1121 07:22:25.721117 1241 model_lifecycle.cc:818] successfully loaded 'ensemble'
I1121 07:22:25.721209 1241 server.cc:592]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I1121 07:22:25.721291 1241 server.cc:619]
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max |
| | | -batch-size":"4"}} |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max |
| | | -batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------+
I1121 07:22:25.721355 1241 server.cc:662]
+----------------+---------+--------+
| Model | Version | Status |
+----------------+---------+--------+
| ensemble | 1 | READY |
| postprocessing | 1 | READY |
| preprocessing | 1 | READY |
| tensorrt_llm | 1 | READY |
+----------------+---------+--------+
I1121 07:22:25.760708 1241 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA H100 NVL
I1121 07:22:25.760858 1241 metrics.cc:710] Collecting CPU metrics
I1121 07:22:25.760946 1241 tritonserver.cc:2458]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.39.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters s |
| | tatistics trace logging |
| model_repository_path[0] | all_models/inflight_batcher_llm/ |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I1121 07:22:25.764392 1241 grpc_server.cc:2513] Started GRPCInferenceService at 0.0.0.0:8001
I1121 07:22:25.764549 1241 http_server.cc:4497] Started HTTPService at 0.0.0.0:8000
I1121 07:22:25.805388 1241 http_server.cc:270] Started Metrics Service at 0.0.0.0:8002
bhsueh@75653932fc7c:/home/scratch.bhsueh_sw_1/workspace/TensorRT-LLM/tllm_backend_nvbug$ wget https://github.com/triton-inference-server/tensorrtllm_backend/files/13236039/data.json
--2023-11-21 07:22:39-- https://github.com/triton-inference-server/tensorrtllm_backend/files/13236039/data.json
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-repository-file-5c1aeb/690665848/13236039?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20231121%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231121T072239Z&X-Amz-Expires=300&X-Amz-Signature=b2524a4618993a56a867efaf05d7cb9331c2393f4e3aa3986c45e613eb19a717&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=690665848&response-content-disposition=attachment%3Bfilename%3Ddata.json&response-content-type=application%2Fjson [following]
--2023-11-21 07:22:40-- https://objects.githubusercontent.com/github-production-repository-file-5c1aeb/690665848/13236039?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20231121%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231121T072239Z&X-Amz-Expires=300&X-Amz-Signature=b2524a4618993a56a867efaf05d7cb9331c2393f4e3aa3986c45e613eb19a717&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=690665848&response-content-disposition=attachment%3Bfilename%3Ddata.json&response-content-type=application%2Fjson
Resolving objects.githubusercontent.com (objects.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to objects.githubusercontent.com (objects.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8347 (8.2K) [application/json]
Saving to: 'data.json'
data.json 100%[==============================================================================================================================>] 8.15K --.-KB/s in 0.004s
2023-11-21 07:22:40 (1.97 MB/s) - 'data.json' saved [8347/8347]
bhsueh@75653932fc7c:/home/scratch.bhsueh_sw_1/workspace/TensorRT-LLM/tllm_backend_nvbug$ sudo apt-get install apache2-utils
[sudo] password for bhsueh:
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
libapr1 libaprutil1
The following NEW packages will be installed:
apache2-utils libapr1 libaprutil1
0 upgraded, 3 newly installed, 0 to remove and 33 not upgraded.
Need to get 290 kB of archives.
After this operation, 992 kB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 libapr1 amd64 1.7.0-8ubuntu0.22.04.1 [108 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 libaprutil1 amd64 1.6.1-5ubuntu4.22.04.2 [92.8 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 apache2-utils amd64 2.4.52-1ubuntu4.6 [89.1 kB]
Fetched 290 kB in 1s (263 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 3.)
debconf: falling back to frontend: Readline
Selecting previously unselected package libapr1:amd64.
(Reading database ... 25870 files and directories currently installed.)
Preparing to unpack .../libapr1_1.7.0-8ubuntu0.22.04.1_amd64.deb ...
Unpacking libapr1:amd64 (1.7.0-8ubuntu0.22.04.1) ...
Selecting previously unselected package libaprutil1:amd64.
Preparing to unpack .../libaprutil1_1.6.1-5ubuntu4.22.04.2_amd64.deb ...
Unpacking libaprutil1:amd64 (1.6.1-5ubuntu4.22.04.2) ...
Selecting previously unselected package apache2-utils.
Preparing to unpack .../apache2-utils_2.4.52-1ubuntu4.6_amd64.deb ...
Unpacking apache2-utils (2.4.52-1ubuntu4.6) ...
Setting up libapr1:amd64 (1.7.0-8ubuntu0.22.04.1) ...
Setting up libaprutil1:amd64 (1.6.1-5ubuntu4.22.04.2) ...
Setting up apache2-utils (2.4.52-1ubuntu4.6) ...
Processing triggers for libc-bin (2.35-0ubuntu3.3) ...
bhsueh@75653932fc7c:/home/scratch.bhsueh_sw_1/workspace/TensorRT-LLM/tllm_backend_nvbug$ ab -c 4 -n 4 -p data.json localhost:8000/v2/models/ensemble/generate
This is ApacheBench, Version 2.3 <$Revision: 1879490 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient).....done
Server Software:
Server Hostname: localhost
Server Port: 8000
Document Path: /v2/models/ensemble/generate
Document Length: 9828 bytes
Concurrency Level: 4
Time taken for tests: 5.787 seconds
Complete requests: 4
Failed requests: 2
(Connect: 0, Receive: 0, Length: 2, Exceptions: 0)
Total transferred: 39790 bytes
Total body sent: 34020
HTML transferred: 39346 bytes
Requests per second: 0.69 [#/sec] (mean)
Time per request: 5786.643 [ms] (mean)
Time per request: 1446.661 [ms] (mean, across all concurrent requests)
Transfer rate: 6.72 [Kbytes/sec] received
5.74 kb/s sent
12.46 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 1772 2800 923.1 2775 4014
Waiting: 1772 2800 923.2 2774 4014
Total: 1772 2800 923.1 2775 4014
Percentage of the requests served within a certain time (ms)
50% 2775
66% 2775
75% 4014
80% 4014
90% 4014
95% 4014
98% 4014
99% 4014
100% 4014 (longest request)
Will continue investigating when I have bandwidth.
Same issues. Serving Baichuan2-7b with world_size=2, config as follow:
max_batch_size: 16
parameters: {
key: "gpt_model_type"
value: {
string_value: "inflight_fused_batching"
}
}
parameters: {
key: "batch_scheduler_policy"
value: {
string_value: "guaranteed_no_evict"
}
}
parameters: {
key: "max_num_sequences"
value: {
string_value: "32"
}
}
parameters: {
key: "enable_trt_overlap"
value: {
string_value: "True"
}
}
Any updates? Has the problem been solved?
Any updates? Has the problem been solved?
Hi @zhaocc1106
Change enable_trt_overlap
to False
seems work for me. You can have a try :)
Any updates? Has the problem been solved?
Hi @zhaocc1106 Change
enable_trt_overlap
toFalse
seems work for me. You can have a try :)
Thanks very much!!!
This just fixed the same issue for me! Thank you!
What is the intuition on why this worked?
Inference failed while using ab with more than 3 concurrency but was ok with 1 or 2 concurrency. Using an A10G GPU, with Driver Version: 545.23.06,CUDA Version: 12.3, trt version:9.1, vicuna 13b-1.5-16k。Have any workround?
[TensorRT-LLM][WARNING] Step function failed, continuing. 2023-11-01 10:20:21,882 INFO totally input 7 tokens
[TensorRT-LLM][ERROR] Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaEventSynchronize(get()): an illegal memory access was encountered (/app/tensorrt_llm/cpp/include/tensorrt_llm/runtime/cudaEvent.h:66)
1 0x7f2d2e81e045 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x36045) [0x7f2d2e81e045]
2 0x7f2d2e87fa8a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x97a8a) [0x7f2d2e87fa8a]
3 0x7f2d2e84c821 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64821) [0x7f2d2e84c821]
4 0x7f2d2e8515c7 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x695c7) [0x7f2d2e8515c7]
5 0x7f2d2e83b241 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x53241) [0x7f2d2e83b241]
6 0x7f2d2e83c38a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5438a) [0x7f2d2e83c38a]
7 0x7f2d92e64253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f2d92e64253]
8 0x7f2d92bf4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f2d92bf4ac3]
9 0x7f2d92c85bf4 clone + 68
[TensorRT-LLM][ERROR] Encountered error for requestId 760313751: Encountered an error in forward function: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaEventSynchronize(get()): an illegal memory access was encountered (/app/tensorrt_llm/cpp/include/tensorrt_llm/runtime/cudaEvent.h:66)
1 0x7f2d2e81e045 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x36045) [0x7f2d2e81e045]
2 0x7f2d2e87fa8a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x97a8a) [0x7f2d2e87fa8a]
3 0x7f2d2e84c821 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x64821) [0x7f2d2e84c821]
4 0x7f2d2e8515c7 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x695c7) [0x7f2d2e8515c7]
5 0x7f2d2e83b241 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x53241) [0x7f2d2e83b241]
6 0x7f2d2e83c38a /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x5438a) [0x7f2d2e83c38a]
7 0x7f2d92e64253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f2d92e64253]
8 0x7f2d92bf4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f2d92bf4ac3]
9 0x7f2d92c85bf4 clone + 68
[TensorRT-LLM][WARNING] Step function failed, continuing.