The problem: with "dynamic_batching" enabled, Triton inference server sometimes doesn't respond properly and logging "response is nullptr" several times, and sometimes crash.
The model is a pretty standard BERT model, downloaded from here, and then converted using huggingface_bert_convert.py script. The model is deployed with Triton inference server with ft_backend enabled.
In config.pbtxt, the input sequence length is modified to a fixed value 384, and the hidden state dim to 768 (according to bert_model.config)
perf_analyzer is used for benchmarking with custom generated data. This issue happens when "dynamic_batching {}" was added to config.pbtxt and concurrency >= 3 (with some randomness).
Another observation is that the server may crash if is_remove_padding equals to 1, with following output:
terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: an illegal memory access was encountered /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/triton_backend/bert/BertTritonModelInstance.cc:106
Signal (6) received.
0# 0x000055D1DC9BF459 in /opt/tritonserver/bin/tritonserver
1# 0x00007F816A83A090 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
4# 0x00007F816ABF3911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
5# 0x00007F816ABFF38C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
6# 0x00007F816ABFF3F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
7# 0x00007F816ABFF6A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
8# void fastertransformer::check<cudaError>(cudaError, char const*, char const*, int) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
9# BertTritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, triton::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
10# 0x00007F8160043BA2 in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
11# 0x00007F816AC2BDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
12# 0x00007F816BFA3609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
13# clone in /usr/lib/x86_64-linux-gnu/libc.so.6
If is_remove_padding equals to 0, the server doesn't crash, only output multiple lines of "response is nullptr". And perf_analyzer logs following output before it quits.
Request concurrency: 5
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [1] had error: pinned buffer: failed to perform CUDA copy: invalid argument
Thread [2] had error: pinned buffer: failed to perform CUDA copy: invalid argument
Thread [4] had error: pinned buffer: failed to perform CUDA copy: invalid argument
This python script was used to generate random test data:
import numpy as np
data = np.random.random([384,768]).astype(np.float32) # np.float16
data.tofile('input_hidden_state')
L = np.asarray([384]).astype(np.int32)
L.tofile('sequence_lengths')
Generated data was put in mockdata folder. Then run perf_analyzer.
Description
The problem: with "dynamic_batching" enabled, Triton inference server sometimes doesn't respond properly and logging "response is nullptr" several times, and sometimes crash.
The model is a pretty standard BERT model, downloaded from here, and then converted using
huggingface_bert_convert.py
script. The model is deployed with Triton inference server with ft_backend enabled.In
config.pbtxt
, the input sequence length is modified to a fixed value 384, and the hidden state dim to 768 (according to bert_model.config)perf_analyzer
is used for benchmarking with custom generated data. This issue happens when "dynamic_batching {}" was added toconfig.pbtxt
and concurrency >= 3 (with some randomness).Another observation is that the server may crash if
is_remove_padding
equals to 1, with following output:If
is_remove_padding
equals to 0, the server doesn't crash, only output multiple lines of "response is nullptr". Andperf_analyzer
logs following output before it quits.System info
GPU: GeForce RTX 2080Ti
GPU Driver: 525.60.11
Docker: 19.03.13
nvidia-container-runtime:
nvidia-container-cli:
branch: main
commit: 69a397e2b0d9c74d841bd00d6ccdb6410da6316f
docker base image: nvcr.io/nvidia/tritonserver:22.12-py3
Reproduced Steps
To Prepare the model, download the bert model and convert it with following command.
Build the docker:
Run the server
This python script was used to generate random test data:
Generated data was put in
mockdata
folder. Then runperf_analyzer
.