Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Feature request: BF16 support for integrated TensorRT precision mode
User case:
A trained bert based model in onnx format that works in fp32 precision mode in triton inference server. With fp16 precision triton-inference-server will raise an overflow exeption:
2023-04-30 05:00:53.077756080 [E:onnxruntime:log, tensorrt_execution_provider.h:51 log] [2023-04-30 05:00:53 ERROR] 3: [weightConvertors.cpp::operator()::562] Error Code 3: Miscellaneous (Weights [name=/Constant_2_output_0 + (Unnamed Layer* 81) [Shuffle]{ForeignNode[onnx::MatMul_1532 + (Unnamed Layer* 100) [Shuffle].../encoder/layer.11/output/LayerNorm/Add_1]}] has value -3.40282e+38 outside of FP16 range. A possible fix is to retrain the model with regularization to reduce the magnitude of the weights, or if the intent is to express -infinity, use -infinity instead.)
Signal (11) received.
To address the overflow issue, I tried parameters { key: "precision_mode" value: "BF16" }, yet it is not supported for now.
I0620 11:38:08.153076 1 server.cc:619]
+-------------------+---------+-------------------------------------------------------------------------------+
| Model | Version | Status |
+-------------------+---------+-------------------------------------------------------------------------------+
| bert-base-chinese | 1 | UNAVAILABLE: Invalid argument: unsupported precision mode 'BF16' is requested |
+-------------------+---------+-------------------------------------------------------------------------------+
Describe the solution you'd like
A clear and concise description of what you want to happen.
Integrated BF16 precision support via config.pbtxtDescribe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] Feature request: BF16 support for integrated TensorRT precision mode User case: A trained bert based model in onnx format that works in fp32 precision mode in triton inference server. With
fp16
precisiontriton-inference-server
will raise an overflow exeption:The
config.pbtxt
looks likeTo address the overflow issue, I tried
parameters { key: "precision_mode" value: "BF16" }
, yet it is not supported for now.Describe the solution you'd like A clear and concise description of what you want to happen. Integrated
BF16
precision support viaconfig.pbtxt
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.Additional context Add any other context or screenshots about the feature request here.