WARNING：Your ONNX model has been generated with INT64 weights

FrancescoSaverioZuppichini commented 1 year ago

Description

Hello There!

I hope you are all doing well :)

There are other similar issues but no even one of them has a fix to this problem.

Tensorrt takes a lot of time casting IN64 to IN32 making it impossible to use in real life

my conversion code

import onnxruntime as ort
import torch
from torchvision.models import ConvNeXt_Small_Weights, convnext_small

torch.set_default_tensor_type("torch.FloatTensor")
torch.set_default_tensor_type("torch.cuda.FloatTensor")

model_name = "model.onnx"
# get the model and put it in half precision
model = convnext_small(ConvNeXt_Small_Weights.IMAGENET1K_V1).eval().half().cuda()

with torch.autocast("cuda", dtype=torch.float16):
    x = torch.randn(1, 3, 224, 224, device="cuda")
    # # Export the model
    torch.onnx.export(
        model,  # model being run
        x,  # model input (or a tuple for multiple inputs)
        model_name,  # where to save the model (can be a file or file-like object)
        opset_version=16,
        export_params=True,  # store the trained parameter weights inside the model file
        do_constant_folding=True,  # whether to execute constant folding for optimization
        input_names=["image"],  # the model's input names
        output_names=["output"],  # the model's output names
        dynamic_axes={
            "image": {0: "batch_size"},  # variable length axes
            "output": {0: "batch_size"},
        },
    )

# let's check
print("Checking")
x = torch.randn(1, 3, 224, 224, device="cuda")
ort_session = ort.InferenceSession(model_name, providers=["CUDAExecutionProvider"])
outputs = ort_session.run(None, {"image": x.cpu().numpy()})
print(outputs[0].shape, outputs[0].dtype)

The code takes around 3/4m to convert the weights, then it outputs the following

2022-12-05 13:55:19.402807667 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-12-05 13:55:19 WARNING] TensorRT encountered issues when converting weights between types and that could affect accuracy.
2022-12-05 13:55:19.402832897 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-12-05 13:55:19 WARNING] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
2022-12-05 13:55:19.402838637 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-12-05 13:55:19 WARNING] Check verbose logs for the list of affected weights.
2022-12-05 13:55:19.402844777 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-12-05 13:55:19 WARNING] - 8 weights are affected by this issue: Detected subnormal FP16 values.

Environment

I am using the latest nvidia container TensorRT Version: 8.5.1 ONNX-TensorRT Version / Branch: GPU Type: RTX 3090 Nvidia Driver Version: CUDA Version: 11.7 CUDNN Version: Operating System + Version: Python Version (if applicable): 3.8 TensorFlow + TF2ONNX Version (if applicable): 1.13 PyTorch Version (if applicable): Baremetal or Container (if container which image + tag):

Relevant Files

my onnx model ( a convnext)

link to drive

Steps To Reproduce

Download onnx model

Copy this code

x = torch.randn(1, 3, 224, 224, device="cuda")
ort_session = ort.InferenceSession("model.onnx", providers=[
                  (
                      "TensorrtExecutionProvider",
                      {
                                  'device_id': 0,

                          "trt_fp16_enable": True,
                          "trt_max_workspace_size": 2147483648,
                      },
                  ),
                  "CUDAExecutionProvider",
              ],)
outputs = ort_session.run(None, {"image": x.cpu().numpy()})

run it with python

FrancescoSaverioZuppichini commented 1 year ago

Has anyone ever created a mixed precision model?

See https://github.com/microsoft/onnxconverter-common/issues/251 and https://github.com/microsoft/onnxconverter-common/issues/252 but they are not tensorrt related

zhenhuaw-me commented 1 year ago

Hi @FrancescoSaverioZuppichini if I may ask which issue do you expect to fix? The warning can be ignored if your model runs correctly.

FrancescoSaverioZuppichini commented 1 year ago

No it doesn't, it takes a lot of time every time I want to run the onnx session due to casting. The issue shifted from making tensorrt working, to correctly export a mixed precision model and it looks like nobody to this date has ever successfully converted a model from PyTorch to onnx in mixed precision

Bruce-320 commented 1 year ago

@FrancescoSaverioZuppichini Hi ，I have the same problem as you, I can reason correctly but it takes a long time, and I also warn INT64 that it needs to be converted to INT32, how do you solve the problem of long reasoning time in the end?

Bruce-320 commented 1 year ago

@FrancescoSaverioZuppichini Probably because of the forced conversion precision, Tensorrt is not even as fast as regular CUDA inference

FrancescoSaverioZuppichini commented 1 year ago

@Bruce-320 Tensorrt is always faster than normal CUDA inference, usually at least 2x

@Bruce-320 I don't know how to solve the problem, apparently no one in the industry has never run mixed precision model on onnx lol - so I am still trying to figure it out but my approach is to ask directly to the software devs

Bruce-320 commented 1 year ago

Thank your for your answer and I would appreciate it if you can share your next progress.

@Bruce-320 Tensorrt is always faster than normal CUDA inference, usually at least 2x

@Bruce-320 I don't know how to solve the problem, apparently no one in the industry has never run mixed precision model on onnx lol - so I am still trying to figure it out but my approach is to ask directly to the software devs

zhenhuaw-me commented 1 year ago

No it doesn't, it takes a lot of time every time I want to run the onnx session due to casting.

@FrancescoSaverioZuppichini Do you mean the weights casting? Could you please share a log of that?

FrancescoSaverioZuppichini commented 1 year ago

@zhenhuaw-me yes, like it takes a lot of time to cast the weights EVERY time I need to do inference, can you try to reproduce using my code so we can double-check there is something weird on my stack?

onnx / onnx-tensorrt