Open marovira opened 3 months ago
The issue can also be reproduced using C++ with ONNXRuntime and the DirectML provider. Through debugging, I've discovered that the issue is coming from the graph fusion system. Specifically, when it attempts to process /encoders.0/ecoders.0.0/MatMul_2
, an exception is thrown when trying to create a DML_OPERATOR_GEMM
. I am unable to determine why the parameters are incorrect however.
the issue is coming from the graph fusion system
I wonder if specifying a lower optimization level like GraphOptimizationLevel like ORT_ENABLE_BASIC
would mitigate the issue until it can be investigated?
I wonder if specifying a lower optimization level like GraphOptimizationLevel like
ORT_ENABLE_BASIC
would mitigate the issue until it can be investigated?
I've just confirmed that setting the optimisation level to ORT_ENABLE_BASIC
doesn't remove the error message:
2024-07-31 20:34:15.4335912 [E:onnxruntime:, inference_session.cc:2045 onnxruntime::InferenceSession::Initialize::<lambda_deb1d9a98dc3fb814563870e4f4b9f20>::operator ()] Exception during initialization: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(519)\onnxruntime_pybind11_state.pyd!00007FF9979F1B92: (caller: 00007FF9979585ED) Exception(3) tid(1ca60) 80070057 The parameter is incorrect.
Interestingly, if I change the optimisation level to ORT_DISABLE_ALL
, I get this output instead:
2024-07-31 20:36:05.2756227 [W:onnxruntime:, session_state.cc:1166 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-07-31 20:36:05.2793056 [W:onnxruntime:, session_state.cc:1168 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-07-31 20:36:07.4558489 [E:onnxruntime:, sequential_executor.cc:516 onnxruntime::ExecuteKernel] Non-zero status code returned while running MatMul node. Name:'/encoders.0/encoders.0.0/MatMul_2' Status Message: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2468)\onnxruntime_pybind11_state.pyd!00007FF99799F21F: (caller: 00007FF9979A06DA) Exception(3) tid(259d4) 80070057 The parameter is incorrect.
That error message shows the node in which the exception is being thrown, which I mentioned in my previous message.
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
@fdwr is there any more information I can provide that would help with diagnosing/fixing this?
@fdwr is there any more information I can provide that would help with diagnosing/fixing this?
Are there any DirectML debug layer messages, if you enable it?
https://github.com/microsoft/onnxruntime/issues/13330#issuecomment-1284817983 https://github.com/microsoft/onnxruntime/issues/15255#issuecomment-1487703350
Here's the output with the debug layer enabled:
D3D12 ERROR: An invalid dimension count of 5 was specified in tensor 'A' which is not between 2 and 4. [ UNKNOWN ERROR #1: STRING_FROM_APPLICATION]
C:\__w\1\s\SharedValidation/TensorValidator.h(753)\DirectML.Debug.dll!00007FFB58A81BAE: (caller: 00007FFB58A83B85) Exception(1) tid(5f48c) 80070057 The parameter is incorrect.
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: wil::ResultException at memory location 0x0000004AEE1936B0.
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: [rethrow] at memory location 0x0000000000000000.
C:\__w\1\s\Debug\Product\DmlDeviceDebug.cpp(80)\DirectML.Debug.dll!00007FFB58B9EB2E: (caller: 00007FFB5CD2719F) ReturnHr(1) tid(5f48c) 80070057 The parameter is incorrect.
Msg:[C:\__w\1\s\SharedValidation/TensorValidator.h(753)\DirectML.Debug.dll!00007FFB58A81BAE: (caller: 00007FFB58A83B85) Exception(1) tid(5f48c) 80070057 The parameter is incorrect.
]
E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperator.cpp(46)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CCB99C9) Exception(1) tid(5f48c) 80070057 The parameter is incorrect.
[Dml::DmlOperator::SetDmlOperatorDesc(m_dmlDevice->CreateOperator(&operatorDesc, __uuidof(**(&dmlOperator)), IID_PPV_ARGS_Helper(&dmlOperator)))]
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: wil::ResultException at memory location 0x0000004AEE193B70.
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: [rethrow] at memory location 0x0000000000000000.
E:\github\sdk\onnxruntime\onnxruntime\core/providers/dml/OperatorAuthorHelper/MLOperatorAuthorHelper.h(965)\onnxruntimed.dll!00007FFB5E9E1ABE: (caller: 00007FFB5CCB89BE) ReturnHr(1) tid(5f48c) 80070057 The parameter is incorrect.
Msg:[E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperator.cpp(46)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CCB99C9) Exception(1) tid(5f48c) 80070057 The parameter is incorrect.
[Dml::DmlOperator::SetDmlOperatorDesc(m_dmlDevice->CreateOperator(&operatorDesc, __uuidof(**(&dmlOperator)), IID_PPV_ARGS_Helper(&dmlOperator)))]
] [MLOperatorKernel<class Dml::DmlOperatorMatMul>::CreateInstance]
E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperatorMatMul.cpp(58)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CA5B950) Exception(2) tid(5f48c) 80070057 The parameter is incorrect.
[Dml::CreateMatMul(MLOperatorKernel<T>::CreateInstance(*kernelInfo, opKernel))]
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: wil::ResultException at memory location 0x0000004AEE194F90.
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: [rethrow] at memory location 0x0000000000000000.
E:\github\sdk\onnxruntime\onnxruntime\core/providers/dml/OperatorAuthorHelper/MLOperatorAuthorHelper.h(1081)\onnxruntimed.dll!00007FFB5E98793B: (caller: 00007FFB5CB6070C) ReturnHr(2) tid(5f48c) 80070057 The parameter is incorrect.
Msg:[E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperatorMatMul.cpp(58)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CA5B950) Exception(2) tid(5f48c) 80070057 The parameter is incorrect.
[Dml::CreateMatMul(MLOperatorKernel<T>::CreateInstance(*kernelInfo, opKernel))]
] [MLOperatorKernelFactory::CreateKernel]
E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\AbiCustomRegistry.cpp(519)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CB6D93F) Exception(3) tid(5f48c) 80070057 The parameter is incorrect.
[Windows::AI::MachineLearning::Adapter::AbiCustomRegistry::RegisterOperatorKernel::<lambda_4e824286f9d658116fa9a3df675eaad5>::operator ()(kernelFactoryCapture->CreateKernel(kernelInfoWrapper.Get(), kernel.GetAddressOf()))]
Exception thrown at 0x00007FFC80D5FABC in denoise.exe: Microsoft C++ exception: wil::ResultException at memory location 0x0000004AEE195020.
Oooh, so there's a 5D matmul in this model then?
D3D12 ERROR: An invalid dimension count of 5 was specified in tensor 'A' which is not between 2 and 4.
E:\github\sdk\onnxruntime\onnxruntime\core\providers\dml\DmlExecutionProvider\src\Operators\DmlOperatorMatMul.cpp(58)\onnxruntimed.dll!00007FFB5C928719: (caller: 00007FFB5CA5B950) Exception(2) tid(5f48c) 80070057 The parameter is incorrect.
Would you know if there's an upload on HuggingFace or elsewhere of the direct .onnx model?
I've not actually encountered a 5D matmul before in ONNX models, and DML_GEMM_OPERATOR_DESC
currently only accepts 4D, requiring either DirectML.dll updates or some flattening to 4D of leading dimensions (if not broadcasted) before calling DirectML.
Would you know if there's an upload on HuggingFace or elsewhere of the direct .onnx model?
No, the authors only provide the PTH files. I'll see if I can post it somewhere so you can download the ONNX file.
Edit: I've uploaded the ONNX file to Google Drive
Edit: I've uploaded the ONNX file to Google Drive
Let me know if there's anything else I can do to help.
Reduced to minimal repro, a single operator .onnx file: minimal-repro.zip
Opened bug for DirectML.dll. I'll see what the response is, but in the meantime, we should probably attempt to flatten the leading dimensions when >4D here https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/dml/DmlExecutionProvider/src/Operators/DmlOperatorGemm.cpp#L36. Cheers TotK fan.
See also #21875 This same issue happens with the following models on a Windows on ARM machine. https://aihub.qualcomm.com/mobile/models/ffnet_54s https://aihub.qualcomm.com/models/esrgan https://aihub.qualcomm.com/models/whisper_tiny_en https://aihub.qualcomm.com/models/mediapipe_hand (MediaPipeHandLandmarkDetector)
Note: This seems to have been fixed after upgrading to DirectML.dll 1.15.1. I have verified that the failures I reported above now all work.
@fdwr out of interest: is there a way to track the bug that you opened for DirectML? Related to this: you mentioned that ONNXRuntime should attempt to flatten the leading dimensions. Is this being looked at somewhere?
@fdwr out of interest: is there a way to track the bug that you opened for DirectML? Related to this: you mentioned that ONNXRuntime should attempt to flatten the leading dimensions. Is this being looked at somewhere?
@marovira: It's internal, but I can confirm that a teammate is working on it and looking at the change now. So there shouldn't need to be a need to update the ORT DML EP when DML directly supports it.
@fdwr out of interest: is there a way to track the bug that you opened for DirectML? Related to this: you mentioned that ONNXRuntime should attempt to flatten the leading dimensions. Is this being looked at somewhere?
@marovira: It's internal, but I can confirm that a teammate is working on it and looking at the change now. So there shouldn't need to be a need to update the ORT DML EP when DML directly supports it.
@fdwr That's great! Thanks for letting me know. Looking forward to when the fix is available.
Describe the issue
When trying to run KBNet-S (see here) using ONNXRuntime with DirectML, an error occurs during the creation of the session that reads:
This issue does not appear when trying to run using the CPU execution provider.
To reproduce
Create a new virtual environment and run:
Then copy the following files from the KBNet repo:
Open
kbnet_s_arch.py
and modify the imports as follows:Next, download
sidd.pth
into the same directory from hereFinally, create a new file called
test.py
with the following code:Run with
python3 test.py
and see that it fails when trying to create the session. If instead the providers are set toproviders=["CpuExecutionProvider"]
orproviders=["CpuExecutionProvider, DmlExecutionProvider"]
, the session is created correctly.Urgency
Medium urgency. This is blocking a research task I'm currently working on and the deadline is coming up fast. I can work around the issue for now by using the CPU, but I need the GPU for performance reasons.
Platform
Windows
OS Version
Windows 11
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.18.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
DirectML
Execution Provider Library Version
DirectML 1.14.1.0