Open omera-nv opened 1 year ago
how about running FP32 model with CUDA EP? If FP32 is good, then you can try mixed precision conversion by specifying op_block_list. code example
CPU will use fp32 to run the model so it is fine. It seems SimplifiedLayerNormalization has issue in FP16 based on dumping node outputs. You can put it to op_block_list.
SimplifiedLayerNormalization node: SimplifiedLayerNormalization_token_210
Input 0 Name: /model/block.6/layer.1/Add_output_0
Shape: {1,256,512}
OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
-23.15625, 106.1875, -46.09375, ... , -136, -44.75, 5416
-23.1875, 106.1875, -46.125, ... , -136, -44.8125, 5416
-23.15625, 106.1875, -46.125, ... , -136, -44.75, 5416
...
-23.125, 106.1875, -46.125, ... , -136, -44.75, 5416
-23.15625, 106.125, -46.125, ... , -136, -44.75, 5416
-23.21875, 106.1875, -46.09375, ... , -136, -44.75, 5416
Input 1 Name: model.block.7.layer.0.layer_norm.weight
Shape: {512}
OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
0.22058105, 0.18444824, 0.1887207, ... , 0.17089844, 0.18896484, 0.098571777
Placement: CUDAExecutionProvider
-----------
Output 0 Name: /model/block.7/layer.0/layer_norm/Mul_1_output_0
Shape: {1,256,512}
OrtMemoryInfo:[name:Cuda id:0 OrtMemType:0 OrtAllocatorType:1 Device:[DeviceType:1 MemoryType:0 DeviceId:0]]
-0, 0, -0, ... , -0, -0, 0
-0, 0, -0, ... , -0, -0, 0
-0, 0, -0, ... , -0, -0, 0
...
-0, 0, -0, ... , -0, -0, 0
-0, 0, -0, ... , -0, -0, 0
-0, 0, -0, ... , -0, -0, 0
Min=-0,Max=-0,Zero=131072
SimplifiedLayerNormalization
Is this an actual onnx op? Or some cuda kernel that results from fusion? I can't find this op in my graph or in https://github.com/onnx/onnx/blob/main/docs/Operators.md.
Following @wangyems 's advice, I was able to convert to fp16 and run inference with CUDA EP using the following op_block_list:
FP16_BAD_OPS = [
"Add",
"MatMul",
"Mul",
"Pow",
"ReduceMean",
"Sqrt",
]
Removing any of these ops from the list results in a nan or all-zero output (uploaded a new model with these ops blocked to the google drive). However, I'm still getting all zeros from the TRT EP even with these ops blocked.
The op is from fusion, You need run fusion before converting to fp16.
BTW, we have scripts can help export T5 to fp16, or use in beam search: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/t5/convert_to_onnx.py https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/convert_generation.py
For example,
python -m onnxruntime.transformers.models.t5.convert_to_onnx -m t5-small -o -p fp16 --use_gpu --separate_encoder_and_decoder_init
This is the op_block_list we used: https://github.com/microsoft/onnxruntime/blob/abdd4f518a144035fee3b369996d8416a024bdaa/onnxruntime/python/tools/transformers/models/t5/t5_helper.py#L153-L157
Thanks @tianleiwu ! Will definitely take a look. Do you have any clue about what might be wrong with the TRT EP?
@omera-deci, For TRT, you need use FP32 raw onnx models. TRT will change it to fp16 internally.
@tianleiwu I just tried to give the TRT EP the fp32 model. If I don't enable fp16 everything works smoothly, but once I enable fp16 the output is all zeros again. I've uploaded the fp32 model to the drive as well as a new script to reproduce. I guess some layers are overflowing in trt as well - anyway I can block their conversion the same way I did with onnx?
@omera-deci, you can follow https://github.com/NVIDIA/TensorRT/blob/release/8.6/demo/HuggingFace/T5 to export onnx for T5 and run it in TRT EP. I did not see special setting so export onnx might be the key. You can run those scripts and get the onnx models to run in TRT EP.
You will need build from source to support TRT 8.6, and use some new features (like trt_layer_norm_fp32_fallback and explicit input profiles). See the following doc for detail: https://github.com/microsoft/onnxruntime/blob/fd080caf62db1b41463955286c49d6a582c6a45a/docs/execution-providers/TensorRT-ExecutionProvider.md @chilo-ms for comments of fp16 in TRT EP
SimplifiedLayerNormalization
Is this an actual onnx op? Or some cuda kernel that results from fusion? I can't find this op in my graph or in https://github.com/onnx/onnx/blob/main/docs/Operators.md.
Following @wangyems 's advice, I was able to convert to fp16 and run inference with CUDA EP using the following op_block_list:
FP16_BAD_OPS = [ "Add", "MatMul", "Mul", "Pow", "ReduceMean", "Sqrt", ]
Removing any of these ops from the list results in a nan or all-zero output (uploaded a new model with these ops blocked to the google drive). However, I'm still getting all zeros from the TRT EP even with these ops blocked.
hello,I had convert a fp32 model to fp16 model and when using onnx inference, we meet similar problem with you,but I don't know what is the FP16_BAD_OPS,and where it is. best wishes for your reply
@changdong1687, see example script: https://github.com/microsoft/onnxruntime/blob/2580d935cbecd756cef435fb173a2f10237e9d2a/onnxruntime/python/tools/transformers/models/t5/t5_helper.py#L152-L217 You can define your own list of op_block_list for a model.
@changdong1687, see example script:
You can define your own list of op_block_list for a model.
Ok, got it, thank you!
Describe the issue
I have an onnx model (a t5 encoder that I exported from pytorch and then converted to FP16 using
onnxruntime.transformers.float16.convert_float_to_float16
). When I use this model in an inference session that uses the CPU EP it works flawlessly, but running the same model in a session that uses the CUDA EP returns all nans as output. edit: Tried the TRT EP and it fails as well (returns all zeros).I'm aware of https://github.com/microsoft/onnxruntime/issues/9629, https://github.com/microsoft/onnxruntime/issues/831 and https://github.com/microsoft/onnxruntime/issues/11384 but they all seem either very model-specific or return nans on CPU EP as well, which is not my case.
To reproduce
I wrote this small snippet to reproduce (I hope the issue is not my reliance on the nvidia pip libraries). The onnx model can be downloaded from here: https://drive.google.com/drive/folders/1AMNI_cRYn31owMstIvdsW4IcOcRAYvC_?usp=share_link
I'm using cuda 11.7 on Ubuntu 22.04.
Urgency
No response
Platform
Linux
OS Version
Ubuntu 22.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.14.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU, CUDA
Execution Provider Library Version
CUDA 11.7 TRT 8.5.3.1