microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.94k stars 2.81k forks source link

[E:onnxruntime:, sequential_executor.cc:516 onnxruntime::ExecuteKernel] Non-zero status code returned while running LayerNormalization node. #21012

Closed Jose17-ml closed 3 days ago

Jose17-ml commented 2 months ago

Describe the feature request

Hi Experts,

I just started working AI/ML stuff recently. Currently trying to run Hugging Face - Optimum model on GPU using DML-EP

Platform: Windows 11

Model: https://huggingface.co/optimum/m2m100_418M

Changes:

import onnxruntime

session_opt = onnxruntime.SessionOptions() session_opt.log_severity_level = 0

provider = "CPUExecutionProvider"

provider = "DmlExecutionProvider" NUM_ITERATIONS = 1

model_name = "optimum/m2m100_418M"

hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।" chinese_text = "生活就像一盒巧克力。"

model = ORTModelForSeq2SeqLM.from_pretrained(model_name, provider=provider, session_options=session_opt)

When I use "DmlExecutionProvider", I see below error

2024-06-12 14:35:21.2694023 [E:onnxruntime:, sequential_executor.cc:516 onnxruntime::ExecuteKernel] Non-zero status code returned while running LayerNormalization node. Name:'/model/decoder/layer_norm/Mul/LayerNormFusion/' Status Message: C:\a_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2468)\onnxruntime_pybind11_state.pyd!00007FFA9B5A09BF: (caller: 00007FFA9B5A2174) Exception(3) tid(1ff4) 887A0005 The GPU device instance has been suspended. Use GetDeviceRemovedReason to determine the appropriate action.

But where as with "CPUExecutionProvider", I don't see any issue and able to run the model successfully.

So, I need your help to resolve this issue and run with DML-EP.

Thanks

Describe scenario use case

Trying to huggingface-Optimum model with DML-EP

Jose17-ml commented 2 months ago

Hi Experts,

Need your inputs.

Jose17-ml commented 2 months ago

Hi,

Any inputs?

zhangxiang1993 commented 3 weeks ago

Hi Josh, can you provide the information of you GPU device by sending a screenshot of from task manager performace tab. This model is not in our original supported model list. It's much likely the layernorm op used in this op is slightly different from DML-EP supports. It would be helpful if you could provide us more details of the layernorm op usage in this model.

zhangxiang1993 commented 3 days ago

Closing issue due to low activity. Feel free to reopen it with more information about the op.