triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.3k stars 1.48k forks source link

BF16 support for integrated TensorRT precision mode #5959

Open BorisPolonsky opened 1 year ago

BorisPolonsky commented 1 year ago

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] Feature request: BF16 support for integrated TensorRT precision mode User case: A trained bert based model in onnx format that works in fp32 precision mode in triton inference server. With fp16 precision triton-inference-server will raise an overflow exeption:

2023-04-30 05:00:53.077756080 [E:onnxruntime:log, tensorrt_execution_provider.h:51 log] [2023-04-30 05:00:53   ERROR] 3: [weightConvertors.cpp::operator()::562] Error Code 3: Miscellaneous (Weights [name=/Constant_2_output_0 + (Unnamed Layer* 81) [Shuffle]{ForeignNode[onnx::MatMul_1532 + (Unnamed Layer* 100) [Shuffle].../encoder/layer.11/output/LayerNorm/Add_1]}] has value -3.40282e+38 outside of FP16 range. A possible fix is to retrain the model with regularization to reduce the magnitude of the weights, or if the intent is to express -infinity, use -infinity instead.)
Signal (11) received.

The config.pbtxt looks like

platform: "onnxruntime_onnx"

input [
    {
        name: "input_ids"
        data_type: TYPE_INT64
        dims: [-1, -1]
    },
    {
        name: "attention_mask"
        data_type: TYPE_INT64
        dims: [-1, -1]
    },
    {
        name: "token_type_ids"
        data_type: TYPE_INT64
        dims: [-1, -1]
    }
]
output [
    {
        name: "last_hidden_state"
        data_type: TYPE_FP32
        dims: [-1, -1, 768]
    },
    {
        name: "1525"
        data_type: TYPE_FP32
        dims: [-1, 768]
    }
]

optimization {
  execution_accelerators {
    gpu_execution_accelerator : [ {
      name : "tensorrt"
      parameters { key: "precision_mode" value: "FP16" }
      parameters { key: "max_workspace_size_bytes" value: "1073741824" }
    }]
  }
}

To address the overflow issue, I tried parameters { key: "precision_mode" value: "BF16" }, yet it is not supported for now.

I0620 11:38:08.153076 1 server.cc:619]
+-------------------+---------+-------------------------------------------------------------------------------+
| Model             | Version | Status                                                                        |
+-------------------+---------+-------------------------------------------------------------------------------+
| bert-base-chinese | 1       | UNAVAILABLE: Invalid argument: unsupported precision mode 'BF16' is requested |
+-------------------+---------+-------------------------------------------------------------------------------+

Describe the solution you'd like A clear and concise description of what you want to happen. Integrated BF16 precision support via config.pbtxt Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

kthui commented 1 year ago

Thanks for filing the feature request. I have created a ticket for us to investigate further. DLIS-5045