Closed SiChuanJay closed 2 years ago
Based on my experience, I suggest to export FP32 ONNX model. Then use some tools to convert it to FP16 (or mixed precision) model. You will need config which part of the model computed in FP16, and other parts computed in FP32 to preserve enough accuracy.
An example conversion tool: https://github.com/microsoft/onnxconverter-common/blob/master/onnxconverter_common/float16.py
hello, I tried using onnxconverter_common to transform the model, The code is shown below
import onnxmltools from onnxconverter_common.float16 import convert_float_to_float16 input_onnx_model='automatic_test.onnx' out_onnx_model='automatic_16.onnx' onnx_model = onnxmltools.utils.load_model(input_onnx_model)
onnx_model = convert_float_to_float16(onnx_model)
onnxmltools.utils.save_model(onnx_model, out_onnx_model)
but got an error
the onnx model of float32 the onnx model of float16
I don't know why some new operations are introduced, how to solve it? Looking forward to your help
Besides, when I use onnxruntime.quantization to quantify the float32 model. I can get the model right, but the INT8 model is running much slower than float32's model on both CPU and GPU, I wonder why?
I also get some errors when I use benchmark_t5 to see the performance between cpu/gpu io-binding by fp16,but fp32 is normal.Some errors occur like below:
Exception Traceback (most recent call last): File "/home/studio-lab-user/onnxruntime/onnxruntime/python/tools/transformers/t5/benchmark_t5.py", line 254, in main ort_outputs, ort_latency = T5DecoderHelper.onnxruntime_inference(decoder_session, decoder_inputs, args.test_times) File "/home/studio-lab-user/onnxruntime/onnxruntime/python/tools/transformers/t5/t5_helper.py", line 552, in onnxruntime_inference ort_outputs = ort_session.run(None, ort_inputs) File "/home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 188, in run return self._sess.run(output_names, input_feed, run_options) onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(float)) , expected: (tensor(float16)) Exception Traceback (most recent call last): File "/home/studio-lab-user/onnxruntime/onnxruntime/python/tools/transformers/t5/benchmark_t5.py", line 254, in main ort_outputs, ort_latency = T5DecoderHelper.onnxruntime_inference(decoder_session, decoder_inputs, args.test_times) File "/home/studio-lab-user/onnxruntime/onnxruntime/python/tools/transformers/t5/t5_helper.py", line 552, in onnxruntime_inference ort_outputs = ort_session.run(None, ort_inputs) File "/home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 188, in run return self._sess.run(output_names, input_feed, run_options) onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(float)) , expected: (tensor(float16))
When I change the code of t5_helper.py,"pytorch_inference" ,convert the inputs to float16,but got some other errors...
@zhangAlwin if you want to quantize t5 models to int8 please refer to fastT5
@SiChuanJay, it is expected that there are extra "Cast" nodes. The process of converting model from FP32 to FP16: add some Cast nodes to let some nodes to compute in FP16 instead of FP32.
convert_float_to_float16 has parameters that you can specify whether input/output or some node shall be kept in FP32.
INT8 quantization does not have GPU implementation so current quantization is for CPU only. I believe that there are some options you can configure which part of the model to be quantized. You can tune options (like only quantize some operators) to see whether performance can be improved.
@zhangAlwin, the cause might be convert_float_to_float16 has been updated after the T5 benchmark. Since the script complains input type not matched, you can either change the model to let the input to be in FP16 using convert_float_to_float16, or change your input data type in your script.
@zhangAlwin if you want to quantize t5 models to int8 please refer to fastT5
@Ki6an I come from fastt5's git issues(this one -> https://github.com/Ki6an/fastT5/issues/34)。First,INT8 quantization does not have GPU implementation so current quantization is for CPU only. Second,I'm not sure whether fastt5's fp16 precision can use GPU accelerate.In conclusion,fastt5 is a good job to accelerate t5.
@zhangAlwin fastt5 supports only CPU inference with quantization, currently working on implementing GPU support with Cuda and tensorrt.
@zhangAlwin fastt5 supports only CPU inference with quantization, currently working on implementing GPU support with Cuda and tensorrt.
@Ki6an Come on man.I'm confused that how much can be imporved that the performance of seq2seq(or encoder-decoder) model using tensorflow serving/torchserve/onnx serving with gpu.For example,so far,in my test,batch size 32,max_seq_length 200, don't use beam search, the fasted speed of using t5-small to generate summarization is about 150ms. I'm not sure whether fp16 precision can accelerate it.
@SiChuanJay, it is expected that there are extra "Cast" nodes. The process of converting model from FP32 to FP16: add some Cast nodes to let some nodes to compute in FP16 instead of FP32.
convert_float_to_float16 has parameters that you can specify whether input/output or some node shall be kept in FP32.
INT8 quantization does not have GPU implementation so current quantization is for CPU only. I believe that there are some options you can configure which part of the model to be quantized. You can tune options (like only quantize some operators) to see whether performance can be improved.
@zhangAlwin, the cause might be convert_float_to_float16 has been updated after the T5 benchmark. Since the script complains input type not matched, you can either change the model to let the input to be in FP16 using convert_float_to_float16, or change your input data type in your script.
@tianleiwu When i change the inputs to float16,got some errors like "RuntimeError: expected scalar type Half but found Float",dame it。I hope that you can fix it if you have time.
@SiChuanJay can you confirm that the PyTorch model outputs float16 tensors (before conversion to ONNX)?
If so then this may be a bug in torch.onnx.export. Please:
@SiChuanJay, it is expected that there are extra "Cast" nodes. The process of converting model from FP32 to FP16: add some Cast nodes to let some nodes to compute in FP16 instead of FP32. convert_float_to_float16 has parameters that you can specify whether input/output or some node shall be kept in FP32. INT8 quantization does not have GPU implementation so current quantization is for CPU only. I believe that there are some options you can configure which part of the model to be quantized. You can tune options (like only quantize some operators) to see whether performance can be improved. @zhangAlwin, the cause might be convert_float_to_float16 has been updated after the T5 benchmark. Since the script complains input type not matched, you can either change the model to let the input to be in FP16 using convert_float_to_float16, or change your input data type in your script.
@tianleiwu Tianlei Wu FTE When i change the inputs to float16,got some errors like "RuntimeError: expected scalar type Half but found Float",dame it。I hope that you can fix it if you have time.
@Alwin4Zhang ideally you should use float16 input to get an float16 output from pytorch, which would then be exported to onnx using float16 too. However, many pytorch operators do not allow mixing fp16 and fp32 in the input and will raise something like "RuntimeError: expected scalar type Half but found Float"
as you described. It might be a bug or by design
From one of your screenshots, it seems the operator in question is the Resize-11
. Is it the case? If it is, the issue could be for for opset 11 (and 13 actually), scales
is always a tensor(float)
whereas X
input can be any of tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(float16), tensor(float), tensor(double), tensor(string), tensor(bool), tensor(complex64), tensor(complex128)
. Multiplying X.half()
by scales.float()
would fail with type mismatch.
To test this theory, first, try to change this
def export(self):
rec = (torch.zeros([1, 1, 1, 1]).to(self.args.device, self.precision),) * 4
src = torch.randn(1, 3, 1080, 1920).to(self.args.device, self.precision)
downsample_ratio = torch.tensor([0.25]).to(self.args.device, dtype=self.precision)
dynamic_spatial = {0: 'batch_size', 2: 'height', 3: 'width'}
dynamic_everything = {0: 'batch_size', 1: 'channels', 2: 'height', 3: 'width'}
torch.onnx.export(
self.model,
(src, *rec, downsample_ratio),
self.args.output,
export_params=True,
opset_version=self.args.opset,
do_constant_folding=True,
input_names=['src', 'r1i', 'r2i', 'r3i', 'r4i', 'downsample_ratio'],
output_names=['fgr', 'pha', 'r1o', 'r2o', 'r3o', 'r4o'],
dynamic_axes={
'src': dynamic_spatial,
'fgr': dynamic_spatial,
'pha': dynamic_spatial,
'r1i': dynamic_everything,
'r2i': dynamic_everything,
'r3i': dynamic_everything,
'r4i': dynamic_everything,
'r1o': dynamic_spatial,
'r2o': dynamic_spatial,
'r3o': dynamic_spatial,
'r4o': dynamic_spatial,
})
to
def export(self):
rec = (torch.zeros([1, 1, 1, 1]).to(self.args.device, self.precision),) * 4
src = torch.randn(1, 3, 1080, 1920).to(self.args.device, self.precision)
downsample_ratio = torch.tensor([0.25]).to(self.args.device, dtype=self.precision)
dynamic_spatial = {0: 'batch_size', 2: 'height', 3: 'width'}
dynamic_everything = {0: 'batch_size', 1: 'channels', 2: 'height', 3: 'width'}
with torch.autocast(device_type=str(self.args.device), enabled=True):
torch.onnx.export(
self.model,
(src, *rec, downsample_ratio),
self.args.output,
export_params=True,
opset_version=self.args.opset,
do_constant_folding=True,
input_names=['src', 'r1i', 'r2i', 'r3i', 'r4i', 'downsample_ratio'],
output_names=['fgr', 'pha', 'r1o', 'r2o', 'r3o', 'r4o'],
dynamic_axes={
'src': dynamic_spatial,
'fgr': dynamic_spatial,
'pha': dynamic_spatial,
'r1i': dynamic_everything,
'r2i': dynamic_everything,
'r3i': dynamic_everything,
'r4i': dynamic_everything,
'r1o': dynamic_spatial,
'r2o': dynamic_spatial,
'r3o': dynamic_spatial,
'r4o': dynamic_spatial,
})
With this change (wrapping export
call on a torch.autocast()
context manager), the mismatch between fp16 and fp32 should go away. However, the output can be either fp16 or fp32 depending on the actual implementation of the operator in question. Some ops create the output node with dtype
matching the first input dtype. So passing input.half()
as input would create a float16
output. Other ops just hard code the type.
Try the code above and provides an give us an actual model that reproduces the issue. from model.model import MattingNetwork
does not help us reproduce the issue and provide an accurate answer.
I am assuming 1) my suggestion above worked or 2) the issue is not relevant anymore because there was no response from author after 9 months. Thus, will close it. Feel free to reopen it if that is relevant with instructions on where t get the model
which is imported by the script
Hello, When I try to export the PyTorch model as an ONNX model with accuracy of FLOAT16, in the ONNX structure diagram, the input is float16, but the output is still float32, as shown below, and an error is reported at runtime.
Thanks for your help!