triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
411 stars 133 forks source link

Dynamic batching does not work in decoupled model #149

Closed safehumeng closed 1 year ago

safehumeng commented 1 year ago

Description

use r22.12
set 
model_transaction_policy {
  decoupled: True
}
dynamic_batching {
  max_queue_delay_microseconds: 5000000
}
send two requests get max_batchs_size=1, The next request must wait for the previous request to end before starting

but set  decoupled : False
send two requests get max_batchs_size=2, two requests returned simultaneously

Reproduced Steps

docker pull nvcr.io/nvidia/tritonserver:22.12-py3
rm /opt/tritonserver/lib/cmake/FasterTransformer/ -rf # Remove original library
cd fastertransformer_backend
mkdir build -p && cd build && \
    cmake \
      -D CMAKE_EXPORT_COMPILE_COMMANDS=1 \
      -D CMAKE_BUILD_TYPE=Release \
      -D CMAKE_INSTALL_PREFIX=/opt/tritonserver \
      -D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      -D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      -D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      .. && \
    make -j"$(grep -c ^processor /proc/cpuinfo)" install
tritonserver --model-repository=${model_path}
safehumeng commented 1 year ago

use r23.05 works