Throughput (requests per second / RPS) not increasing when scaling up from 1 GPU to 4 GPUs

Really appreciate the awesome work by the team - have managed to get almost a x100 speedup so far with the fastertransformer_backend on triton compared to plain PyTorch with a fine-tuned T5-base model. This jupyter notebook from the team was a great reference material in helping me achieve that.

I had used a locust load testing script to hit the triton inference server directly with a couple of binary files that I created from an original set of raw sample texts (which would otherwise have to be generated by the python script I am using to query the triton inference server in actual usage), just to measure the throughput I am getting from the triton server.

On a server with a single Tesla T4 GPU after a little tuning - which included loading 2 model instances on that single GPU and turning on dynamic batching with some max queue delay, I managed to get a throughput of about 8 RPS (requests per second).

The config.pbtxt used in the single GPU server is as follows:

# Copyright (c) 2021-2022, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#  * Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#  * Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#  * Neither the name of NVIDIA CORPORATION nor the names of its
#    contributors may be used to endorse or promote products derived
#    from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "t5"
max_batch_size: 96
input [
  {
    name: "input_ids"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "is_return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "max_output_len"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "beam_width"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "start_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_UINT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 2
    kind : KIND_CPU
  }
]
parameters {
  key: "tensor_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "pipeline_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "is_half"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "enable_custom_all_reduce"
  value: {
    string_value: "0"
  }
}
parameters {
  key: "model_type"
  value: {
    string_value: "T5"
  }
}
parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "./triton-model-store/t5/fastertransformer/ft_models_fp16/1-gpu/"
  }
}
dynamic_batching {
  max_queue_delay_microseconds: 500
}

As I am expecting to deal with an even larger workload with more concurrent users, I tried scaling up the server to 4 Tesla T4 GPUs and reconverted the PyTorch model to a FasterTransformer model for the 4 GPU instance using:

python3 ./FasterTransformer/examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py -i ./models/pt_models/ -o ./triton-model-store/t5/fastertransformer/ft_models_fp16/ -i_g 4 -weight_data_type fp16

I also modified the following config.pbtxt parameters:

instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
parameters {
  key: "tensor_para_size"
  value: {
    string_value: "4"
  }
}
...
parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "./triton-model-store/t5/fastertransformer/ft_models_fp16/4-gpu/"
  }
}

and started up the triton inference server with:

CUDA_VISIBLE_DEVICES=0,1,2,3 /opt/server-2.34.0/build/triton-server/tritonserver --model-repository=./triton-model-store/t5

(Yes, I am running a local build of triton-server instead of the pre-built docker image for both the single GPU and 4 GPU instances due to certain restrictions.)

One would expect to get at least some factor of speedups just by virtue of having more GPUs (even when not yet properly tuned), so you can imagine my surprise when I found that the RPS was still stuck at about 8 - essentially no difference from running on the 1 GPU server.

I then tried to increase the model instance count per GPU in config.pbtxt from 1 to 2 to match the single GPU server configs, but ran into another issue where the GPU utilization for all 4 GPUs would very quickly jump to 100% and the tritonserver process on CPU would also jump to 100%. It would then get stuck there spinning at 100% utilization but would not actually be returning any responses.

Coming across this comment in issue #34, I wonder if there is some known issue about fastertransformer_backend or triton itself not being able to run multiple model instances on a multi-GPU server if the GPU (e.g. Tesla T4) does not natively support P2P GPU connection - and whether this is also part of the cause for the throughput on a 4 GPU server having little to no difference to the throughput on a 1 GPU server.

Would the team or anyone else have any other clues on why the multi-GPU resources are not being fully utilized in this case and why I am having issues running multiple instances of the same T5 model on each GPU in a multi-GPU server?

Thanks in advance!

triton-inference-server / fastertransformer_backend

Throughput (requests per second / RPS) not increasing when scaling up from 1 GPU to 4 GPUs #163