SiChuanJay commented 2 years ago

Hello, When I try to export the PyTorch model as an ONNX model with accuracy of FLOAT16, in the ONNX structure diagram, the input is float16, but the output is still float32, as shown below, and an error is reported at runtime. 149513940-4e1e8b7c-7d8a-4f9a-a6e1-86f6a645ceaa

import argparse
import torch

from model.model import MattingNetwork

class Exporter:
    def __init__(self):
        self.parse_args()
        self.init_model()
        self.export()

    def parse_args(self):
        parser = argparse.ArgumentParser()
        parser.add_argument('--model-variant', type=str, default='mobilenetv3', choices=['mobilenetv3', 'mobilenetv3_small', 'resnet18'])
        parser.add_argument('--model-refiner', type=str, default='deep_guided_filter', choices=['deep_guided_filter', 'fast_guided_filter'])
        parser.add_argument('--precision', type=str, default='float16', choices=['float16', 'float32'])
        parser.add_argument('--opset', type=int, default=11)
        parser.add_argument('--device', type=str, default='cuda:0')
        parser.add_argument('--checkpoint', type=str, default='automatic.pth')
        parser.add_argument('--output', type=str, default=r'C:\Users\HuangZhe\Desktop\automatic.onnx')
        self.args = parser.parse_args()

    def init_model(self):
        self.precision = torch.float32 if self.args.precision == 'float32' else torch.float16
        self.model = MattingNetwork(self.args.model_variant, self.args.model_refiner).eval().to(self.args.device, self.precision)
        if self.args.checkpoint is not None:
            self.model.load_state_dict(torch.load(self.args.checkpoint, map_location=self.args.device), strict=False)

    def export(self):
        rec = (torch.zeros([1, 1, 1, 1]).to(self.args.device, self.precision),) * 4
        src = torch.randn(1, 3, 1080, 1920).to(self.args.device, self.precision)
        downsample_ratio = torch.tensor([0.25]).to(self.args.device, dtype=self.precision)

        dynamic_spatial = {0: 'batch_size', 2: 'height', 3: 'width'}
        dynamic_everything = {0: 'batch_size', 1: 'channels', 2: 'height', 3: 'width'}

        torch.onnx.export(
            self.model,
            (src, *rec, downsample_ratio),
            self.args.output,
            export_params=True,
            opset_version=self.args.opset,
            do_constant_folding=True,
            input_names=['src', 'r1i', 'r2i', 'r3i', 'r4i', 'downsample_ratio'],
            output_names=['fgr', 'pha', 'r1o', 'r2o', 'r3o', 'r4o'],
            dynamic_axes={
                'src': dynamic_spatial,
                'fgr': dynamic_spatial,
                'pha': dynamic_spatial,
                'r1i': dynamic_everything,
                'r2i': dynamic_everything,
                'r3i': dynamic_everything,
                'r4i': dynamic_everything,
                'r1o': dynamic_spatial,
                'r2o': dynamic_spatial,
                'r3o': dynamic_spatial,
                'r4o': dynamic_spatial,
            })

if __name__ == '__main__':
    Exporter()

Thanks for your help!

tianleiwu commented 2 years ago

Based on my experience, I suggest to export FP32 ONNX model. Then use some tools to convert it to FP16 (or mixed precision) model. You will need config which part of the model computed in FP16, and other parts computed in FP32 to preserve enough accuracy.

An example conversion tool: https://github.com/microsoft/onnxconverter-common/blob/master/onnxconverter_common/float16.py

SiChuanJay commented 2 years ago

hello, I tried using onnxconverter_common to transform the model, The code is shown below

import onnxmltools from onnxconverter_common.float16 import convert_float_to_float16 input_onnx_model='automatic_test.onnx' out_onnx_model='automatic_16.onnx' onnx_model = onnxmltools.utils.load_model(input_onnx_model)

Convert tensor float type from your input ONNX model to tensor float16

onnx_model = convert_float_to_float16(onnx_model)

Save as protobuf

onnxmltools.utils.save_model(onnx_model, out_onnx_model)

but got an error

the onnx model of float32 the onnx model of float16

I don't know why some new operations are introduced, how to solve it? Looking forward to your help

SiChuanJay commented 2 years ago

Besides, when I use onnxruntime.quantization to quantify the float32 model. I can get the model right, but the INT8 model is running much slower than float32's model on both CPU and GPU, I wonder why?

Alwin4Zhang commented 2 years ago

I also get some errors when I use benchmark_t5 to see the performance between cpu/gpu io-binding by fp16，but fp32 is normal.Some errors occur like below:

Exception Traceback (most recent call last): File "/home/studio-lab-user/onnxruntime/onnxruntime/python/tools/transformers/t5/benchmark_t5.py", line 254, in main ort_outputs, ort_latency = T5DecoderHelper.onnxruntime_inference(decoder_session, decoder_inputs, args.test_times) File "/home/studio-lab-user/onnxruntime/onnxruntime/python/tools/transformers/t5/t5_helper.py", line 552, in onnxruntime_inference ort_outputs = ort_session.run(None, ort_inputs) File "/home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 188, in run return self._sess.run(output_names, input_feed, run_options) onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(float)) , expected: (tensor(float16)) Exception Traceback (most recent call last): File "/home/studio-lab-user/onnxruntime/onnxruntime/python/tools/transformers/t5/benchmark_t5.py", line 254, in main ort_outputs, ort_latency = T5DecoderHelper.onnxruntime_inference(decoder_session, decoder_inputs, args.test_times) File "/home/studio-lab-user/onnxruntime/onnxruntime/python/tools/transformers/t5/t5_helper.py", line 552, in onnxruntime_inference ort_outputs = ort_session.run(None, ort_inputs) File "/home/studio-lab-user/.conda/envs/default/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 188, in run return self._sess.run(output_names, input_feed, run_options) onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(float)) , expected: (tensor(float16))

When I change the code of t5_helper.py,"pytorch_inference" ,convert the inputs to float16，but got some other errors...

Ki6an commented 2 years ago

@zhangAlwin if you want to quantize t5 models to int8 please refer to fastT5

tianleiwu commented 2 years ago

@SiChuanJay, it is expected that there are extra "Cast" nodes. The process of converting model from FP32 to FP16: add some Cast nodes to let some nodes to compute in FP16 instead of FP32.

convert_float_to_float16 has parameters that you can specify whether input/output or some node shall be kept in FP32.

INT8 quantization does not have GPU implementation so current quantization is for CPU only. I believe that there are some options you can configure which part of the model to be quantized. You can tune options (like only quantize some operators) to see whether performance can be improved.

@zhangAlwin, the cause might be convert_float_to_float16 has been updated after the T5 benchmark. Since the script complains input type not matched, you can either change the model to let the input to be in FP16 using convert_float_to_float16, or change your input data type in your script.

Alwin4Zhang commented 2 years ago

@zhangAlwin if you want to quantize t5 models to int8 please refer to fastT5

@Ki6an I come from fastt5's git issues(this one -> https://github.com/Ki6an/fastT5/issues/34)。First,INT8 quantization does not have GPU implementation so current quantization is for CPU only. Second,I'm not sure whether fastt5's fp16 precision can use GPU accelerate.In conclusion,fastt5 is a good job to accelerate t5.

Ki6an commented 2 years ago

@zhangAlwin fastt5 supports only CPU inference with quantization, currently working on implementing GPU support with Cuda and tensorrt.

Alwin4Zhang commented 2 years ago

@zhangAlwin fastt5 supports only CPU inference with quantization, currently working on implementing GPU support with Cuda and tensorrt.

@Ki6an Come on man.I'm confused that how much can be imporved that the performance of seq2seq(or encoder-decoder) model using tensorflow serving/torchserve/onnx serving with gpu.For example,so far,in my test,batch size 32,max_seq_length 200, don't use beam search, the fasted speed of using t5-small to generate summarization is about 150ms. I'm not sure whether fp16 precision can accelerate it.

Alwin4Zhang commented 2 years ago

@SiChuanJay, it is expected that there are extra "Cast" nodes. The process of converting model from FP32 to FP16: add some Cast nodes to let some nodes to compute in FP16 instead of FP32.

convert_float_to_float16 has parameters that you can specify whether input/output or some node shall be kept in FP32.

INT8 quantization does not have GPU implementation so current quantization is for CPU only. I believe that there are some options you can configure which part of the model to be quantized. You can tune options (like only quantize some operators) to see whether performance can be improved.

@zhangAlwin, the cause might be convert_float_to_float16 has been updated after the T5 benchmark. Since the script complains input type not matched, you can either change the model to let the input to be in FP16 using convert_float_to_float16, or change your input data type in your script.

@tianleiwu When i change the inputs to float16,got some errors like "RuntimeError: expected scalar type Half but found Float"，dame it。I hope that you can fix it if you have time.

garymm commented 2 years ago

@SiChuanJay can you confirm that the PyTorch model outputs float16 tensors (before conversion to ONNX)?

If so then this may be a bug in torch.onnx.export. Please:

Try with the latest nightly PyTorch.
If you still have a problem, file a bug with reproduction instructions at github.com/pytorch/pytorch

thiagocrepaldi commented 2 years ago

@SiChuanJay, it is expected that there are extra "Cast" nodes. The process of converting model from FP32 to FP16: add some Cast nodes to let some nodes to compute in FP16 instead of FP32. convert_float_to_float16 has parameters that you can specify whether input/output or some node shall be kept in FP32. INT8 quantization does not have GPU implementation so current quantization is for CPU only. I believe that there are some options you can configure which part of the model to be quantized. You can tune options (like only quantize some operators) to see whether performance can be improved. @zhangAlwin, the cause might be convert_float_to_float16 has been updated after the T5 benchmark. Since the script complains input type not matched, you can either change the model to let the input to be in FP16 using convert_float_to_float16, or change your input data type in your script.

@tianleiwu Tianlei Wu FTE When i change the inputs to float16,got some errors like "RuntimeError: expected scalar type Half but found Float"，dame it。I hope that you can fix it if you have time.

@Alwin4Zhang ideally you should use float16 input to get an float16 output from pytorch, which would then be exported to onnx using float16 too. However, many pytorch operators do not allow mixing fp16 and fp32 in the input and will raise something like "RuntimeError: expected scalar type Half but found Float" as you described. It might be a bug or by design

From one of your screenshots, it seems the operator in question is the Resize-11. Is it the case? If it is, the issue could be for for opset 11 (and 13 actually), scales is always a tensor(float) whereas X input can be any of tensor(uint8), tensor(uint16), tensor(uint32), tensor(uint64), tensor(int8), tensor(int16), tensor(int32), tensor(int64), tensor(float16), tensor(float), tensor(double), tensor(string), tensor(bool), tensor(complex64), tensor(complex128). Multiplying X.half() by scales.float() would fail with type mismatch.

To test this theory, first, try to change this

    def export(self):
        rec = (torch.zeros([1, 1, 1, 1]).to(self.args.device, self.precision),) * 4
        src = torch.randn(1, 3, 1080, 1920).to(self.args.device, self.precision)
        downsample_ratio = torch.tensor([0.25]).to(self.args.device, dtype=self.precision)

        dynamic_spatial = {0: 'batch_size', 2: 'height', 3: 'width'}
        dynamic_everything = {0: 'batch_size', 1: 'channels', 2: 'height', 3: 'width'}

        torch.onnx.export(
            self.model,
            (src, *rec, downsample_ratio),
            self.args.output,
            export_params=True,
            opset_version=self.args.opset,
            do_constant_folding=True,
            input_names=['src', 'r1i', 'r2i', 'r3i', 'r4i', 'downsample_ratio'],
            output_names=['fgr', 'pha', 'r1o', 'r2o', 'r3o', 'r4o'],
            dynamic_axes={
                'src': dynamic_spatial,
                'fgr': dynamic_spatial,
                'pha': dynamic_spatial,
                'r1i': dynamic_everything,
                'r2i': dynamic_everything,
                'r3i': dynamic_everything,
                'r4i': dynamic_everything,
                'r1o': dynamic_spatial,
                'r2o': dynamic_spatial,
                'r3o': dynamic_spatial,
                'r4o': dynamic_spatial,
            })

to

    def export(self):
        rec = (torch.zeros([1, 1, 1, 1]).to(self.args.device, self.precision),) * 4
        src = torch.randn(1, 3, 1080, 1920).to(self.args.device, self.precision)
        downsample_ratio = torch.tensor([0.25]).to(self.args.device, dtype=self.precision)

        dynamic_spatial = {0: 'batch_size', 2: 'height', 3: 'width'}
        dynamic_everything = {0: 'batch_size', 1: 'channels', 2: 'height', 3: 'width'}

        with torch.autocast(device_type=str(self.args.device), enabled=True):
                torch.onnx.export(
                    self.model,
                    (src, *rec, downsample_ratio),
                    self.args.output,
                    export_params=True,
                    opset_version=self.args.opset,
                    do_constant_folding=True,
                    input_names=['src', 'r1i', 'r2i', 'r3i', 'r4i', 'downsample_ratio'],
                    output_names=['fgr', 'pha', 'r1o', 'r2o', 'r3o', 'r4o'],
                    dynamic_axes={
                        'src': dynamic_spatial,
                        'fgr': dynamic_spatial,
                        'pha': dynamic_spatial,
                        'r1i': dynamic_everything,
                        'r2i': dynamic_everything,
                        'r3i': dynamic_everything,
                        'r4i': dynamic_everything,
                        'r1o': dynamic_spatial,
                        'r2o': dynamic_spatial,
                        'r3o': dynamic_spatial,
                        'r4o': dynamic_spatial,
                    })

With this change (wrapping export call on a torch.autocast() context manager), the mismatch between fp16 and fp32 should go away. However, the output can be either fp16 or fp32 depending on the actual implementation of the operator in question. Some ops create the output node with dtype matching the first input dtype. So passing input.half() as input would create a float16 output. Other ops just hard code the type.

Try the code above and provides an give us an actual model that reproduces the issue. from model.model import MattingNetwork does not help us reproduce the issue and provide an accurate answer.

thiagocrepaldi commented 2 years ago

I am assuming 1) my suggestion above worked or 2) the issue is not relevant anymore because there was no response from author after 9 months. Thus, will close it. Feel free to reopen it if that is relevant with instructions on where t get the model which is imported by the script

microsoft / onnxruntime

A problem was encountered exporting an ONNX model with accuracy of FLOAT16 #10308

Convert tensor float type from your input ONNX model to tensor float16

Save as protobuf