triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.14k stars 1.46k forks source link

How to use different max_batch_size configurations for version? #168

Closed JungukCho closed 5 years ago

JungukCho commented 5 years ago

Hi,

I have three plan files for --model-store. The directories looked like this. $ tree |-- 1 | -- model.plan |-- 2 | -- model.plan |-- 3 | -- model.plan |-- config.pbtxt -- imagenet_labels_1001.txt

The model.plan files in 1, 2 directories are generated to support 1024 batch request and the model.plan file in 3 directory is generated to support 256 batch request.

This is my config.pbtxt

$ cat config.pbtxt name: "plan_model" platform: "tensorrt_plan" max_batch_size: 1024 input [ { name: "input" data_type: TYPE_FP32 format: FORMAT_NCHW dims: [3, 299, 299] } ] output [ { name: "InceptionV3/Predictions/Reshape_1"

name: "InceptionV3/Logits/SpatialSqueeze"

            data_type: TYPE_FP32 
            dims: [ 1001,1,1 ]
            label_filename: "imagenet_labels_1001.txt"
    } 

] version_policy: { all { }}

When I ran "trtserver", it showed this error. Loading servable: {name: plan_model version: 3} failed: Invalid argument: unexpected configuration maximum batch size 1024 for 'plan_model', model maximum is 256 I0320 01:11:30.511115 6306 loader_harness.cc:154] Encountered an error for servable version {name: plan_model version: 3}: Invalid argument: unexpected configuration maximum batch size 1024 for 'plan_model', model maximum is 256 E0320 01:11:30.511127 6306 aspired_versions_manager.cc:358] Servable {name: plan_model version: 3} cannot be loaded: Invalid argument: unexpected configuration maximum batch size 1024 for 'plan_model', model maximum is 256

Can I specify batch size per version? If so, would you please let me know how to specify it in config.pbtxt?

GuanLuo commented 5 years ago

TRTIS does not support specifying batch size per version. You will need to separate version 3 as a different model so that you can specify a different batch size for it.

JungukCho commented 5 years ago

Hi, GuanLuo.

Thank you for reply. If I put them in different models, it works well. I just wonder whether there is a fundamental reasons to prevent different batch size per version?

I have some other questions. I used tensorRT 5.1.2 and CUDA 10.1 version with V100

  1. I converted tensorflow model to plan model using convert_plan.py (https://github.com/NVIDIA-AI-IOT/tf_to_trt_image_classification#convert-frozen-graph-to-tensorrt-engine).

However, when I used high number for max_batch_size (e.g., 2000) as an argument, it showed this error. $ python3 scripts/convert_plan.py data/frozen_graphs/inception_v3_2016_08_28_frozen.pb data/plans/inception_v3.plan input 299 299 InceptionV3/Predictions/Reshape_1 2000 0 float

UFFParser: Applying order forwarding to: InceptionV3/Predictions/Reshape_1 UFFParser: parsing MarkOutput_0 UFFParser: Applying order forwarding to: MarkOutput_0 Tensor: InceptionV3/InceptionV3/Conv2d_2b_3x3/convolution at max batch size of 2000 exceeds the maximum element count of 2147483647

Do you know why it showed this error and how to calculate max batch? Also, do you know what "max workspace size" argument in convert_plan.py is?

  1. The "max_batch_size" in inference server configuration file depends on the argument (i.e., max batch size) when I run convert_plan.py file. Even though I set "dynamic batcher" as scheduling_choice, the max_batch_size used in "convert_plan" is maximum boundary. For example, when I configure 1 as max_batch_size running convert_plan.py, the max_batch_size in inference server is 1. Is it right?

  2. If I configure 1024 as max_batch_size in inference server, are 1024 inputs processed in parallel at GPU or copied from host to device as batch? In case of it processed 1024 inputs at the same time when I configure 1024 as max_batch_size and count value in instance-group as 2, it means 2048 inputs are computed simultaneously in GPU?

In metric from inference server, there is queue_duration_us. Is it the queuing time in scheduler before executing on GPU?

  1. V100 has Tensor Core. Does TRTIS use Tensor Core or do I have to do something else? Is there a way of getting specific statistics about Tensor Core (e.g., execution time)?

Thanks,

GuanLuo commented 5 years ago

I can answer 2. and 3.

For 2. "max batch size" in model configuration file determine the maximum batch size that one inference can have, in the scenario you described, you can set it to be from 1 to the number you used in convert_plan.py. The "dynamic batcher" is used to batch multiple inferences as one inference to achieve maximum throughput of the model. For instance, if your model supports maximum batch size 16, you may set dynamic batching with preferred batch size 16 (or any other number that works the best for your model). Then when inference requests are sent to the server (i.e. 3 inferences that have batch size 4, 4, 8), the inference server will combine those requests to a batch with batch size 16 and let the model process it (3 inferences v.s. 1 inference).

For 3. Yes. And yes, it is the queuing time in scheduler before executing on GPU.

deadeyegoodwin commented 5 years ago

1: The error seems to be that by setting your max batch size to 2000, some comvolution operation in your model becomes to large and exceeds the maximum size limit allowed for the convolution. Workspace size is described in the TensorRT documentation. For example, there is some discussion here: https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#build_engine_c

4: Models that are being served by TRTIS can take advantage of TensorCores. It is up to the creator of each particular model to make sure that they perform the mixed-precision or int8 optimization required to use TensorCores. TRTIS itself doesn't do anything to enable or disable the use of TensorCores. The TensorRT documentation has a section on mixed-precision: https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#mixed_precision. Also for pytorch and Tensorflow there is recent work on AMP (automatic mixed-precision) which are tools to help you use mixed-precision in your models. Here's a blog post on TF AMP: https://devblogs.nvidia.com/nvidia-automatic-mixed-precision-tensorflow/

deadeyegoodwin commented 5 years ago

Closing. Reopen if you have additional questions.