Open carsonswope opened 2 years ago
Model files attached here for a minimal reproduction. They are very simple models exported from pytorch w/ only a Conv2d operation. The only difference is that the 'no error' model does not specify a 'groups' argument, while the 'yes error' model does.
@fdwr for insights.
Hi @carsonswope , The sample model is successfully passing on following configurations: ORT: 1.12.0 EP: DML D3D12 Debug Layer: ON GPU: AMD Radeon VII
I have also tried running the sample model with overriding the free dimension height
and width
with configurations same as mentioned above, and it is successfully passing as well.
The error doesn't actually cause the program to fail when I run it, but it does cause a crash for another member of my team on slightly different hardware
Can you please provide us more details on what specific hardware the model is crashing on?
@carsonswope These first chance exceptions inside Direct3D are ignorable - it's unhandled exceptions that are worrisome, and these are caught and handled by D3D itself (returning a bad HRESULT to DirectML). It would be nice to avoid these first-chance _com_error
exceptions in the first place in D3D (as they are distracting red herrings and noisy) if DML could know when, but DML can't know every combination of operator parameters that a driver may or may not support ahead of time :/.
it does cause a crash for another member of my team on slightly different hardware.
Is the hardware above nvidia geforce 1080 ti, 11GB
yours or your teammate's repro machine?
Appreciate you looking into this @sumitsays and @fdwr
The nvidia geforce 1080ti, 11GB
is mine. My teammate's machine, where the exception is unhandled, has an nvidia quadro P6000, 24GB
. Driver version 516.59
.
If you are having trouble reproducing from the minimal example I provided above, maybe try this one? This is the full version of the model I'm working with, and it triggers a bunch of the _com_pointer
exceptions.
Also, this is what the stack trace looks like for the crash, or at least as much of it as we're able to see:
KernelBase.dll!00007ff94fef4fd9 Unknown
ucrtbase.dll!00007ff94fbada1d Unknown
D3D12Core.dll!00007ff926dfe6b1 Unknown
D3D12Core.dll!00007ff926de1e4c Unknown
D3D12Core.dll!00007ff926dc4f1c Unknown
d3d12SDKLayers.dll!00007ff89e1c4439 Unknown
d3d12SDKLayers.dll!00007ff89e130d9b Unknown
d3d12SDKLayers.dll!00007ff89e115252 Unknown
D3D12.dll!00007ff928d75737 Unknown
D3D12Core.dll!00007ff926e60ce3 Unknown
d3d12SDKLayers.dll!00007ff89e129732 Unknown
DirectML.dll!MetaCommand::TryCreate C++
DirectML.dll!ConvolutionMetaCommand::TryCreateLatest C++
DirectML.dll!QueryMetaCommand<RoiPoolingMetaCommand,DmlRoiPoolingOperatorDesc> C++
DirectML.dll!DmlMetaCommand::TryCreateConvolution C++
DirectML.dll!DmlConvolutionOperator::TryCompile C++
DirectML.dll!DmlConvolutionOperator::Compile C++
DirectML.dll!DmlDevice::CompileOperator C++
DirectML.dll!MLGraph::DML::DMLOpaqueOperationDesc::Compile C++
DirectML.dll!MLGraph::OperationNodeImpl::Compile C++
DirectML.dll!MLGraph::Compilation::CompileOperators::Execute C++
DirectML.dll!MLGraph::PassManager::ExecutePasses C++
DirectML.dll!MLGraph::DML::GraphCompiler::CompileGraph C++
DirectML.dll!DmlDevice::CompileGraphPrivate C++
DirectML.dll!DmlDevice::CompileGraph C++
DirectML.Debug.dll!DmlDeviceDebug::CompileGraph C++
> onnxruntime.dll!Dml::FusedGraphKernel::TranslateAndCompileGraph Line 145 C++
Describe the bug Hi,
I'm seeing the following error when I try to execute a model with the DirectML EP w/ a debugger attached:
(The 2nd line of the message only appears when the D3D12 debug layer is enabled.)
It occurs when the model is first 'compiled' - either the first time the model is executed, or right when the model is loaded if I specify all the dimension overrides via the
AddFreeDimensionOverrideByName
function. The error doesn't actually cause the program to fail when I run it, but it does cause a crash for another member of my team on slightly different hardware.From my perspective, it seems like this is a bug in the DirectML EP or in DirectML itself. But, I could see how it would be expected behavior if that's the only way to determine if an operation can be executed via MetaCommand before falling back to the default implementation. Can someone please advise?
Thanks,
Carson
Urgency None...
System information