microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.49k stars 2.9k forks source link

MetaCommand exception from DirectML EP #12328

Open carsonswope opened 2 years ago

carsonswope commented 2 years ago

Describe the bug Hi,

I'm seeing the following error when I try to execute a model with the DirectML EP w/ a debugger attached:

Exception thrown at 0x00007FFCB5ED4FD9 in run_demo.exe: Microsoft C++ exception: _com_error at memory location 0x000000E04D6F7400.
D3D12 MESSAGE: ID3D12GraphicsCommandList::CreateMetaCommand: MetaCommand parameters are not supported by the current system configuration. [ STATE_CREATION MESSAGE #1172: META_COMMAND_UNSUPPORTED_PARAMS]

(The 2nd line of the message only appears when the D3D12 debug layer is enabled.)

It occurs when the model is first 'compiled' - either the first time the model is executed, or right when the model is loaded if I specify all the dimension overrides via the AddFreeDimensionOverrideByName function. The error doesn't actually cause the program to fail when I run it, but it does cause a crash for another member of my team on slightly different hardware.

From my perspective, it seems like this is a bug in the DirectML EP or in DirectML itself. But, I could see how it would be expected behavior if that's the only way to determine if an operation can be executed via MetaCommand before falling back to the default implementation. Can someone please advise?

Thanks,

Carson

Urgency None...

System information

carsonswope commented 2 years ago

models.zip

Model files attached here for a minimal reproduction. They are very simple models exported from pytorch w/ only a Conv2d operation. The only difference is that the 'no error' model does not specify a 'groups' argument, while the 'yes error' model does.

RandySheriffH commented 2 years ago

@fdwr for insights.

sumitsays commented 2 years ago

Hi @carsonswope , The sample model is successfully passing on following configurations: ORT: 1.12.0 EP: DML D3D12 Debug Layer: ON GPU: AMD Radeon VII

I have also tried running the sample model with overriding the free dimension heightand widthwith configurations same as mentioned above, and it is successfully passing as well.

The error doesn't actually cause the program to fail when I run it, but it does cause a crash for another member of my team on slightly different hardware

Can you please provide us more details on what specific hardware the model is crashing on?

fdwr commented 2 years ago

@carsonswope These first chance exceptions inside Direct3D are ignorable - it's unhandled exceptions that are worrisome, and these are caught and handled by D3D itself (returning a bad HRESULT to DirectML). It would be nice to avoid these first-chance _com_error exceptions in the first place in D3D (as they are distracting red herrings and noisy) if DML could know when, but DML can't know every combination of operator parameters that a driver may or may not support ahead of time :/.

it does cause a crash for another member of my team on slightly different hardware.

Is the hardware above nvidia geforce 1080 ti, 11GB yours or your teammate's repro machine?

carsonswope commented 2 years ago

Appreciate you looking into this @sumitsays and @fdwr

The nvidia geforce 1080ti, 11GB is mine. My teammate's machine, where the exception is unhandled, has an nvidia quadro P6000, 24GB. Driver version 516.59.

If you are having trouble reproducing from the minimal example I provided above, maybe try this one? This is the full version of the model I'm working with, and it triggers a bunch of the _com_pointer exceptions.

crash_repro_noweights.zip

Also, this is what the stack trace looks like for the crash, or at least as much of it as we're able to see:

    KernelBase.dll!00007ff94fef4fd9 Unknown
    ucrtbase.dll!00007ff94fbada1d   Unknown
    D3D12Core.dll!00007ff926dfe6b1  Unknown
    D3D12Core.dll!00007ff926de1e4c  Unknown
    D3D12Core.dll!00007ff926dc4f1c  Unknown
    d3d12SDKLayers.dll!00007ff89e1c4439 Unknown
    d3d12SDKLayers.dll!00007ff89e130d9b Unknown
    d3d12SDKLayers.dll!00007ff89e115252 Unknown
    D3D12.dll!00007ff928d75737  Unknown
    D3D12Core.dll!00007ff926e60ce3  Unknown
    d3d12SDKLayers.dll!00007ff89e129732 Unknown
    DirectML.dll!MetaCommand::TryCreate C++
    DirectML.dll!ConvolutionMetaCommand::TryCreateLatest    C++
    DirectML.dll!QueryMetaCommand<RoiPoolingMetaCommand,DmlRoiPoolingOperatorDesc>    C++
    DirectML.dll!DmlMetaCommand::TryCreateConvolution   C++
    DirectML.dll!DmlConvolutionOperator::TryCompile C++
    DirectML.dll!DmlConvolutionOperator::Compile    C++
    DirectML.dll!DmlDevice::CompileOperator C++
    DirectML.dll!MLGraph::DML::DMLOpaqueOperationDesc::Compile  C++
    DirectML.dll!MLGraph::OperationNodeImpl::Compile    C++
    DirectML.dll!MLGraph::Compilation::CompileOperators::Execute    C++
    DirectML.dll!MLGraph::PassManager::ExecutePasses    C++
    DirectML.dll!MLGraph::DML::GraphCompiler::CompileGraph  C++
    DirectML.dll!DmlDevice::CompileGraphPrivate C++
    DirectML.dll!DmlDevice::CompileGraph    C++
    DirectML.Debug.dll!DmlDeviceDebug::CompileGraph C++
>    onnxruntime.dll!Dml::FusedGraphKernel::TranslateAndCompileGraph Line 145    C++