microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.7k stars 2.93k forks source link

DirectML error: The parameter is incorrect with KBNet S #21583

Open marovira opened 3 months ago

marovira commented 3 months ago

Describe the issue

When trying to run KBNet-S (see here) using ONNXRuntime with DirectML, an error occurs during the creation of the session that reads:

2024-07-31 17:26:51.9481346 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_deb1d9a98dc3fb814563870e4f4b9f20>::operator ()] Exception during initialization: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(519)\onnxruntime_pybind11_state.pyd!00007FF9E88E1B92: (caller: 00007FF9E88485ED) Exception(3) tid(25d60) 80070057 The parameter is incorrect.

This issue does not appear when trying to run using the CPU execution provider.

To reproduce

Create a new virtual environment and run:

pip install torch onnx onnxruntime-directml

Then copy the following files from the KBNet repo:

Open kbnet_s_arch.py and modify the imports as follows:

# Replace this:
from basicsr.models.archs.kb_utils import KBAFunction
from basicsr.models.archs.kb_utils import LayerNorm2d, SimpleGate

# With this:
from kb_utils import KBAFunction, LayerNorm2d, SimpleGate

Next, download sidd.pth into the same directory from here

Finally, create a new file called test.py with the following code:

from kbnet import KBNet_s
import torch
import onnx
import onnxruntime

net = KBNet_s(lightweight=True, ffn_scale=1.5).cpu()
state = torch.load("sidd.pth", weights_only=True)
net.load_state_dict(state["model"])
net.eval()
x = torch.randn((1, 3, 128, 128))
with torch.no_grad():
    out = net(x)

torch.onnx.export(net, x, "sidd.onnx", export_params=True, do_constant_folding=True,
                  input_names=["input"],
                  output_names=["output"])

onnx_model = onnx.load("sidd.onnx")
onnx.checker.check_model(onnx_model) # Note that this passes, so the exported ONNX file is correct.

ort_session = onnxruntime.InferenceSession("sidd.onnx", providers=["DmlExecutionProvider"]) # <- This line fails!

Run with python3 test.py and see that it fails when trying to create the session. If instead the providers are set to providers=["CpuExecutionProvider"] or providers=["CpuExecutionProvider, DmlExecutionProvider"], the session is created correctly.

Urgency

Medium urgency. This is blocking a research task I'm currently working on and the deadline is coming up fast. I can work around the issue for now by using the CPU, but I need the GPU for performance reasons.

Platform

Windows

OS Version

Windows 11

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

DirectML 1.14.1.0

marovira commented 3 months ago

Additional Info

The issue can also be reproduced using C++ with ONNXRuntime and the DirectML provider. Through debugging, I've discovered that the issue is coming from the graph fusion system. Specifically, when it attempts to process /encoders.0/ecoders.0.0/MatMul_2, an exception is thrown when trying to create a DML_OPERATOR_GEMM. I am unable to determine why the parameters are incorrect however.

fdwr commented 3 months ago

the issue is coming from the graph fusion system

I wonder if specifying a lower optimization level like GraphOptimizationLevel like ORT_ENABLE_BASIC would mitigate the issue until it can be investigated?

marovira commented 3 months ago

I wonder if specifying a lower optimization level like GraphOptimizationLevel like ORT_ENABLE_BASIC would mitigate the issue until it can be investigated?

I've just confirmed that setting the optimisation level to ORT_ENABLE_BASIC doesn't remove the error message:

2024-07-31 20:34:15.4335912 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_deb1d9a98dc3fb814563870e4f4b9f20>::operator ()] Exception during initialization: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(519)\onnxruntime_pybind11_state.pyd!00007FF9979F1B92: (caller: 00007FF9979585ED) Exception(3) tid(1ca60) 80070057 The parameter is incorrect.

Interestingly, if I change the optimisation level to ORT_DISABLE_ALL, I get this output instead:

2024-07-31 20:36:05.2756227 [W:onnxruntime:, session_state.cc:1166 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-07-31 20:36:05.2793056 [W:onnxruntime:, session_state.cc:1168 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-07-31 20:36:07.4558489 [E:onnxruntime:, sequential_executor.cc:516 onnxruntime::ExecuteKernel] Non-zero status code returned while running MatMul node. Name:'/encoders.0/encoders.0.0/MatMul_2' Status Message: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2468)\onnxruntime_pybind11_state.pyd!00007FF99799F21F: (caller: 00007FF9979A06DA) Exception(3) tid(259d4) 80070057 The parameter is incorrect.

That error message shows the node in which the exception is being thrown, which I mentioned in my previous message.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

marovira commented 2 months ago

@fdwr is there any more information I can provide that would help with diagnosing/fixing this?

fdwr commented 2 months ago

@fdwr is there any more information I can provide that would help with diagnosing/fixing this?

Are there any DirectML debug layer messages, if you enable it?

https://github.com/microsoft/onnxruntime/issues/13330#issuecomment-1284817983 https://github.com/microsoft/onnxruntime/issues/15255#issuecomment-1487703350

marovira commented 2 months ago

Here's the output with the debug layer enabled:

D3D12 ERROR: An invalid dimension count of 5 was specified in tensor 'A' which is not between 2 and 4. [ UNKNOWN ERROR #1: STRING_FROM_APPLICATION]
C:\__w\1\s\SharedValidation/TensorValidator.h(753)\DirectML.Debug.dll!00007FFB58A81BAE: (caller: 00007FFB58A83B85) Exception(1) tid(5f48c) 80070057 The parameter is incorrect.
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: wil::ResultException at memory location 0x0000004AEE1936B0.
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: [rethrow] at memory location 0x0000000000000000.
C:\__w\1\s\Debug\Product\DmlDeviceDebug.cpp(80)\DirectML.Debug.dll!00007FFB58B9EB2E: (caller: 00007FFB5CD2719F) ReturnHr(1) tid(5f48c) 80070057 The parameter is incorrect.
    Msg:[C:\__w\1\s\SharedValidation/TensorValidator.h(753)\DirectML.Debug.dll!00007FFB58A81BAE: (caller: 00007FFB58A83B85) Exception(1) tid(5f48c) 80070057 The parameter is incorrect.
] 
E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperator.cpp(46)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CCB99C9) Exception(1) tid(5f48c) 80070057 The parameter is incorrect.
    [Dml::DmlOperator::SetDmlOperatorDesc(m_dmlDevice->CreateOperator(&operatorDesc, __uuidof(**(&dmlOperator)), IID_PPV_ARGS_Helper(&dmlOperator)))]
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: wil::ResultException at memory location 0x0000004AEE193B70.
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: [rethrow] at memory location 0x0000000000000000.
E:\github\sdk\onnxruntime\onnxruntime\core/providers/dml/OperatorAuthorHelper/MLOperatorAuthorHelper.h(965)\onnxruntimed.dll!00007FFB5E9E1ABE: (caller: 00007FFB5CCB89BE) ReturnHr(1) tid(5f48c) 80070057 The parameter is incorrect.
    Msg:[E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperator.cpp(46)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CCB99C9) Exception(1) tid(5f48c) 80070057 The parameter is incorrect.
    [Dml::DmlOperator::SetDmlOperatorDesc(m_dmlDevice->CreateOperator(&operatorDesc, __uuidof(**(&dmlOperator)), IID_PPV_ARGS_Helper(&dmlOperator)))]
] [MLOperatorKernel<class Dml::DmlOperatorMatMul>::CreateInstance]
E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperatorMatMul.cpp(58)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CA5B950) Exception(2) tid(5f48c) 80070057 The parameter is incorrect.
    [Dml::CreateMatMul(MLOperatorKernel<T>::CreateInstance(*kernelInfo, opKernel))]
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: wil::ResultException at memory location 0x0000004AEE194F90.
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: [rethrow] at memory location 0x0000000000000000.
E:\github\sdk\onnxruntime\onnxruntime\core/providers/dml/OperatorAuthorHelper/MLOperatorAuthorHelper.h(1081)\onnxruntimed.dll!00007FFB5E98793B: (caller: 00007FFB5CB6070C) ReturnHr(2) tid(5f48c) 80070057 The parameter is incorrect.
    Msg:[E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperatorMatMul.cpp(58)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CA5B950) Exception(2) tid(5f48c) 80070057 The parameter is incorrect.
    [Dml::CreateMatMul(MLOperatorKernel<T>::CreateInstance(*kernelInfo, opKernel))]
] [MLOperatorKernelFactory::CreateKernel]
E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(519)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CB6D93F) Exception(3) tid(5f48c) 80070057 The parameter is incorrect.
    [Windows::AI::MachineLearning::Adapter::AbiCustomRegistry::RegisterOperatorKernel::<lambda_4e824286f9d658116fa9a3df675eaad5>::operator ()(kernelFactoryCapture->CreateKernel(kernelInfoWrapper.Get(), kernel.GetAddressOf()))]
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: wil::ResultException at memory location 0x0000004AEE195020.
fdwr commented 2 months ago

Oooh, so there's a 5D matmul in this model then?

D3D12 ERROR: An invalid dimension count of 5 was specified in tensor 'A' which is not between 2 and 4.
E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperatorMatMul.cpp(58)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CA5B950) Exception(2) tid(5f48c) 80070057 The parameter is incorrect.

Would you know if there's an upload on HuggingFace or elsewhere of the direct .onnx model?

I've not actually encountered a 5D matmul before in ONNX models, and DML_GEMM_OPERATOR_DESC currently only accepts 4D, requiring either DirectML.dll updates or some flattening to 4D of leading dimensions (if not broadcasted) before calling DirectML.

marovira commented 2 months ago

Would you know if there's an upload on HuggingFace or elsewhere of the direct .onnx model?

No, the authors only provide the PTH files. I'll see if I can post it somewhere so you can download the ONNX file.

Edit: I've uploaded the ONNX file to Google Drive

fdwr commented 2 months ago

Edit: I've uploaded the ONNX file to Google Drive

image

marovira commented 2 months ago

Let me know if there's anything else I can do to help.

fdwr commented 2 months ago

Reduced to minimal repro, a single operator .onnx file: minimal-repro.zip

image

Opened bug for DirectML.dll. I'll see what the response is, but in the meantime, we should probably attempt to flatten the leading dimensions when >4D here https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorGemm.cpp#L36. Cheers TotK fan.

john-dance commented 1 month ago

See also #21875 This same issue happens with the following models on a Windows on ARM machine. https://aihub.qualcomm.com/mobile/models/ffnet_54s https://aihub.qualcomm.com/models/esrgan https://aihub.qualcomm.com/models/whisper_tiny_en https://aihub.qualcomm.com/models/mediapipe_hand (MediaPipeHandLandmarkDetector)

john-dance commented 1 month ago

Note: This seems to have been fixed after upgrading to DirectML.dll 1.15.1. I have verified that the failures I reported above now all work.

marovira commented 2 weeks ago

@fdwr out of interest: is there a way to track the bug that you opened for DirectML? Related to this: you mentioned that ONNXRuntime should attempt to flatten the leading dimensions. Is this being looked at somewhere?

fdwr commented 2 weeks ago

@fdwr out of interest: is there a way to track the bug that you opened for DirectML? Related to this: you mentioned that ONNXRuntime should attempt to flatten the leading dimensions. Is this being looked at somewhere?

@marovira: It's internal, but I can confirm that a teammate is working on it and looking at the change now. So there shouldn't need to be a need to update the ORT DML EP when DML directly supports it.

marovira commented 2 weeks ago

@fdwr out of interest: is there a way to track the bug that you opened for DirectML? Related to this: you mentioned that ONNXRuntime should attempt to flatten the leading dimensions. Is this being looked at somewhere?

@marovira: It's internal, but I can confirm that a teammate is working on it and looking at the change now. So there shouldn't need to be a need to update the ORT DML EP when DML directly supports it.

@fdwr That's great! Thanks for letting me know. Looking forward to when the fix is available.