triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
411 stars 133 forks source link

Not getting response with warning "response is nullptr" #80

Open t13m opened 1 year ago

t13m commented 1 year ago

Description

The problem: with "dynamic_batching" enabled, Triton inference server sometimes doesn't respond properly and logging "response is nullptr" several times, and sometimes crash.

The model is a pretty standard BERT model, downloaded from here, and then converted using huggingface_bert_convert.py script. The model is deployed with Triton inference server with ft_backend enabled.

In config.pbtxt, the input sequence length is modified to a fixed value 384, and the hidden state dim to 768 (according to bert_model.config)

perf_analyzer is used for benchmarking with custom generated data. This issue happens when "dynamic_batching {}" was added to config.pbtxt and concurrency >= 3 (with some randomness).

Another observation is that the server may crash if is_remove_padding equals to 1, with following output:

terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR] CUDA runtime error: an illegal memory access was encountered /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/triton_backend/bert/BertTritonModelInstance.cc:106

Signal (6) received.
 0# 0x000055D1DC9BF459 in /opt/tritonserver/bin/tritonserver
 1# 0x00007F816A83A090 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# gsignal in /usr/lib/x86_64-linux-gnu/libc.so.6
 3# abort in /usr/lib/x86_64-linux-gnu/libc.so.6
 4# 0x00007F816ABF3911 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007F816ABFF38C in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007F816ABFF3F7 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007F816ABFF6A9 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 8# void fastertransformer::check<cudaError>(cudaError, char const*, char const*, int) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
 9# BertTritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, triton::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
10# 0x00007F8160043BA2 in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
11# 0x00007F816AC2BDE4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
12# 0x00007F816BFA3609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
13# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

If is_remove_padding equals to 0, the server doesn't crash, only output multiple lines of "response is nullptr". And perf_analyzer logs following output before it quits.

Request concurrency: 5
Failed to maintain requested inference load. Worker thread(s) failed to generate concurrent requests.
Thread [1] had error: pinned buffer: failed to perform CUDA copy: invalid argument
Thread [2] had error: pinned buffer: failed to perform CUDA copy: invalid argument
Thread [4] had error: pinned buffer: failed to perform CUDA copy: invalid argument

System info

Reproduced Steps

To Prepare the model, download the bert model and convert it with following command.

# assume the model was in /workspace/models/
mv /workspace/models/{bert_,}config.json
sed -i 's/}/  "bert_type":"bert"\n}/' /workspace/models/config.json

python3 /workspace/FasterTransformer/examples/pytorch/bert/utils/huggingface_bert_convert.py \
 -infer_tensor_para_size 1 \
 -in_file /workspace/models/ \
 -saved_dir /workspace/models_out

mkdir -p triton-model-store/bert-fp32/
mv /workspace/models_out triton-model-store/bert-fp32/1

cat <<EOF > triton-model-store/bert-fp32/config.pbtxt
name: "bert-fp32"
backend: "fastertransformer"
default_model_filename: "bert"
max_batch_size: 1024
input [
  {
    name: "input_hidden_state"
    dims: [ 384, 768 ]
  },
  {
    name: "sequence_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  }
]
output [
  {
    name: "output_hidden_state"
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
parameters {
  key: "tensor_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "pipeline_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "data_type"
  value: {
    string_value: "fp32"
  }
}
parameters {
  key: "enable_custom_all_reduce"
  value: {
    string_value: "0"
  }
}
parameters {
  key: "model_type"
  value: {
    string_value: "bert"
  }
}
parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "triton-model-store/bert-fp32/1/1-gpu/"
  }
}
parameters {
  key: "int8_mode"
  value: {
    string_value: "0"
  }
}
parameters {
  key: "is_sparse"
  value: {
    string_value: "0"
  }
}
parameters {
  key: "is_remove_padding"
  value: {
    string_value: "0"
  }
}
dynamic_batching {
}
EOF

Build the docker:

git clone https://github.com/triton-inference-server/fastertransformer_backend.git
cd fastertransformer_backend
export WORKSPACE=$(pwd)
export CONTAINER_VERSION=22.12
export TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION}

docker build --rm   \
    --build-arg TRITON_VERSION=${CONTAINER_VERSION}   \
    -t ${TRITON_DOCKER_IMAGE} \
    -f docker/Dockerfile \
    .

Run the server

docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --gpus=all --shm-size=4G  -v $(pwd):/ft_workspace --network host --name nv_ft triton_with_ft:22.12 bash

cd /ft_workspace

CUDA_VISIBLE_DEVICES=0 /opt/tritonserver/bin/tritonserver --model-repository=/ft_workspace/triton-model-store

This python script was used to generate random test data:

import numpy as np
data = np.random.random([384,768]).astype(np.float32)  # np.float16
data.tofile('input_hidden_state')
L = np.asarray([384]).astype(np.int32)
L.tofile('sequence_lengths')

Generated data was put in mockdata folder. Then run perf_analyzer.

/path/to/perf_analyzer -m bert-fp32 --input-data ./mockdata/ --concurrency-range 1:10
byshiue commented 1 year ago

Can you reproduce this issue by python example? You may not really use the data you generate.