Inference using the CUDA EP returns nan

omera-nv commented 1 year ago

Describe the issue

I have an onnx model (a t5 encoder that I exported from pytorch and then converted to FP16 using onnxruntime.transformers.float16.convert_float_to_float16). When I use this model in an inference session that uses the CPU EP it works flawlessly, but running the same model in a session that uses the CUDA EP returns all nans as output. edit: Tried the TRT EP and it fails as well (returns all zeros).

I'm aware of https://github.com/microsoft/onnxruntime/issues/9629, https://github.com/microsoft/onnxruntime/issues/831 and https://github.com/microsoft/onnxruntime/issues/11384 but they all seem either very model-specific or return nans on CPU EP as well, which is not my case.

To reproduce

I wrote this small snippet to reproduce (I hope the issue is not my reliance on the nvidia pip libraries). The onnx model can be downloaded from here: https://drive.google.com/drive/folders/1AMNI_cRYn31owMstIvdsW4IcOcRAYvC_?usp=share_link

pip install onnxruntime-gpu nvidia-cuda-runtime-cu11 nvidia-cufft-cu11 nvidia-curand-cu11 nvidia-cublas-cu11 nvidia-cudnn-cu11

#!/usr/bin/env python3
import ctypes
from pathlib import Path
import numpy as np

def print_cuda_ep_libs_versions():
    import importlib.metadata

    for lib in [
        "numpy",
        "onnxruntime-gpu",
        "tensorrt",
        "nvidia-cuda-runtime-cu11",
        "nvidia-cufft-cu11",
        "nvidia-curand-cu11",
        "nvidia-cublas-cu11",
        "nvidia-cudnn-cu11",
    ]:
        print(lib, importlib.metadata.version(lib))

def load_cuda_ep_native_deps():
    import nvidia.cuda_runtime.lib
    import nvidia.cufft.lib
    import nvidia.curand.lib
    import nvidia.cublas.lib
    import nvidia.cudnn.lib

    load_native_lib(Path(nvidia.cuda_runtime.lib.__path__[0]) / "libcudart.so.11.0")
    load_native_lib(Path(nvidia.cufft.lib.__path__[0]) / "libcufft.so.10")
    load_native_lib(Path(nvidia.curand.lib.__path__[0]) / "libcurand.so.10")
    load_native_lib(Path(nvidia.cublas.lib.__path__[0]) / "libcublas.so.11")
    load_native_lib(Path(nvidia.cublas.lib.__path__[0]) / "libcublasLt.so.11")
    load_native_lib(Path(nvidia.cudnn.lib.__path__[0]) / "libcudnn.so.8")

def load_native_lib(library_path):
    ctypes.CDLL(library_path, mode=ctypes.RTLD_GLOBAL)

if __name__ == "__main__":
    print_cuda_ep_libs_versions()
    load_cuda_ep_native_deps()
    import tensorrt
    import onnxruntime as ort

    ort_inputs = {"input_ids": np.ones((1, 256), dtype=np.int64), "attention_mask": np.ones((1, 256), dtype=np.int64)}

    trt_sess = ort.InferenceSession(
        "t5_fp16_encoder.onnx",
        providers=[
            ("TensorrtExecutionProvider", {"trt_fp16_enable": True}),
            "CUDAExecutionProvider",
            "CPUExecutionProvider",
        ],
    )
    cuda_sess = ort.InferenceSession(
        "t5_fp16_encoder.onnx", providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
    )
    cpu_sess = ort.InferenceSession("t5_fp16_encoder.onnx", providers=["CPUExecutionProvider"])

    print("TRT:", trt_sess.run(None, ort_inputs))
    print("CUDA:", cuda_sess.run(None, ort_inputs))
    print("CPU:", cpu_sess.run(None, ort_inputs))

I'm using cuda 11.7 on Ubuntu 22.04.

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.14.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU, CUDA

Execution Provider Library Version

CUDA 11.7 TRT 8.5.3.1

wangyems commented 1 year ago

how about running FP32 model with CUDA EP? If FP32 is good, then you can try mixed precision conversion by specifying op_block_list. code example

tianleiwu commented 1 year ago

CPU will use fp32 to run the model so it is fine. It seems SimplifiedLayerNormalization has issue in FP16 based on dumping node outputs. You can put it to op_block_list.

SimplifiedLayerNormalization node: SimplifiedLayerNormalization_token_210
Input 0 Name: /model/block.6/layer.1/Add_output_0
 Shape: {1,256,512}
OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
-23.15625, 106.1875, -46.09375, ... , -136, -44.75, 5416
-23.1875, 106.1875, -46.125, ... , -136, -44.8125, 5416
-23.15625, 106.1875, -46.125, ... , -136, -44.75, 5416
...
-23.125, 106.1875, -46.125, ... , -136, -44.75, 5416
-23.15625, 106.125, -46.125, ... , -136, -44.75, 5416
-23.21875, 106.1875, -46.09375, ... , -136, -44.75, 5416

Input 1 Name: model.block.7.layer.0.layer_norm.weight
 Shape: {512}
OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
0.22058105, 0.18444824, 0.1887207, ... , 0.17089844, 0.18896484, 0.098571777

Placement: CUDAExecutionProvider
-----------
Output 0 Name: /model/block.7/layer.0/layer_norm/Mul_1_output_0
 Shape: {1,256,512}
OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
-0, 0, -0, ... , -0, -0, 0
-0, 0, -0, ... , -0, -0, 0
-0, 0, -0, ... , -0, -0, 0
...
-0, 0, -0, ... , -0, -0, 0
-0, 0, -0, ... , -0, -0, 0
-0, 0, -0, ... , -0, -0, 0

Min=-0,Max=-0,Zero=131072

omera-nv commented 1 year ago

SimplifiedLayerNormalization

Is this an actual onnx op? Or some cuda kernel that results from fusion? I can't find this op in my graph or in https://github.com/onnx/onnx/blob/main/docs/Operators.md.

Following @wangyems 's advice, I was able to convert to fp16 and run inference with CUDA EP using the following op_block_list:

FP16_BAD_OPS = [
    "Add",
    "MatMul",
    "Mul",
    "Pow",
    "ReduceMean",
    "Sqrt",
]

Removing any of these ops from the list results in a nan or all-zero output (uploaded a new model with these ops blocked to the google drive). However, I'm still getting all zeros from the TRT EP even with these ops blocked.

tianleiwu commented 1 year ago

The op is from fusion, You need run fusion before converting to fp16.

BTW, we have scripts can help export T5 to fp16, or use in beam search: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/t5/convert_to_onnx.py https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/convert_generation.py

For example,

python -m onnxruntime.transformers.models.t5.convert_to_onnx -m t5-small -o -p fp16 --use_gpu --separate_encoder_and_decoder_init

This is the op_block_list we used: https://github.com/microsoft/onnxruntime/blob/abdd4f518a144035fee3b369996d8416a024bdaa/onnxruntime/python/tools/transformers/models/t5/t5_helper.py#L153-L157

omera-nv commented 1 year ago

Thanks @tianleiwu ! Will definitely take a look. Do you have any clue about what might be wrong with the TRT EP?

tianleiwu commented 1 year ago

@omera-deci, For TRT, you need use FP32 raw onnx models. TRT will change it to fp16 internally.

omera-nv commented 1 year ago

@tianleiwu I just tried to give the TRT EP the fp32 model. If I don't enable fp16 everything works smoothly, but once I enable fp16 the output is all zeros again. I've uploaded the fp32 model to the drive as well as a new script to reproduce. I guess some layers are overflowing in trt as well - anyway I can block their conversion the same way I did with onnx?

tianleiwu commented 1 year ago

@omera-deci, you can follow https://github.com/NVIDIA/TensorRT/blob/release/8.6/demo/HuggingFace/T5 to export onnx for T5 and run it in TRT EP. I did not see special setting so export onnx might be the key. You can run those scripts and get the onnx models to run in TRT EP.

You will need build from source to support TRT 8.6, and use some new features (like trt_layer_norm_fp32_fallback and explicit input profiles). See the following doc for detail: https://github.com/microsoft/onnxruntime/blob/fd080caf62db1b41463955286c49d6a582c6a45a/docs/execution-providers/TensorRT-ExecutionProvider.md @chilo-ms for comments of fp16 in TRT EP

changdong1687 commented 2 months ago

SimplifiedLayerNormalization

Is this an actual onnx op? Or some cuda kernel that results from fusion? I can't find this op in my graph or in https://github.com/onnx/onnx/blob/main/docs/Operators.md.

Following @wangyems 's advice, I was able to convert to fp16 and run inference with CUDA EP using the following op_block_list:
FP16_BAD_OPS = [
    "Add",
    "MatMul",
    "Mul",
    "Pow",
    "ReduceMean",
    "Sqrt",
]
Removing any of these ops from the list results in a nan or all-zero output (uploaded a new model with these ops blocked to the google drive). However, I'm still getting all zeros from the TRT EP even with these ops blocked.

hello，I had convert a fp32 model to fp16 model and when using onnx inference, we meet similar problem with you，but I don't know what is the FP16_BAD_OPS，and where it is. best wishes for your reply

tianleiwu commented 2 months ago

@changdong1687, see example script: https://github.com/microsoft/onnxruntime/blob/2580d935cbecd756cef435fb173a2f10237e9d2a/onnxruntime/python/tools/transformers/models/t5/t5_helper.py#L152-L217 You can define your own list of op_block_list for a model.

changdong1687 commented 2 months ago

@changdong1687, see example script:

https://github.com/microsoft/onnxruntime/blob/2580d935cbecd756cef435fb173a2f10237e9d2a/onnxruntime/python/tools/transformers/models/t5/t5_helper.py#L152-L217

You can define your own list of op_block_list for a model.

Ok, got it, thank you!

microsoft / onnxruntime