Consecutive calls to Session::Run() with DML EP crashing

j-paulus commented 3 years ago

Describe the bug I am trying to use DirectML EP for accelerating the network computations. The system loads the model and runs it multiple times, each time with a different input. When running on CPU, this works fine. When running on CUDA EP under Linux, this works fine. When running this on CUDA EP under Windows, there is a significant delay in the start of the processing, but eventually it runs. Using DML EP, the execution crashes on the second call to Session::Run().

This crash seems to depend from the model size. Using the example model below, the crash happens with kernel size of 512x512, but not with 384x384 (these numbers are probably valid only on my system). So a bigger model crashes. This would somehow suggest some memory handling issue.

I am using the dll of 1.5.2 available under releases.

Urgency

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Win 10 Enterprise v1909
ONNX Runtime installed from (source or binary): binary
ONNX Runtime version: 1.5.2
Python version: N/A
Visual Studio version (if applicable): 16.8.2
GCC/Compiler version (if compiling from source): N/A
CUDA/cuDNN version: N/A
GPU model and memory: GTX1660 Ti, 6GB

To Reproduce

Create model with pytorch (1.7):

class ConvAnaSyn(torch.nn.Module):
    def __init__(self, n=384):
        super(ConvAnaSyn, self).__init__()
        self.len = n
        self.hop = self.len // 2
        self.weight = torch.nn.Parameter(torch.eye(self.len)[:, None, :], requires_grad=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = x.permute((0, 2, 1))
        y = torch.nn.functional.conv1d(x, weight=self.weight, bias=None, stride=self.hop, padding=0)
        z = torch.nn.functional.conv_transpose1d(y, weight=self.weight, bias=None, stride=self.hop, padding=0)
        z = z.permute((0, 2, 1))
        return z

n_dim = 384  # 512 to crash
model = ConvAnaSyn(n_dim).float()
x_in = torch.randn((3, 48000, 1), requires_grad=False)
x_out = model(x_in)
new_model_name = 'kernel_{}.onnx'.format(n_dim)

torch.onnx.export(model, (x_in, ), new_model_name, example_outputs=(x_out, ), input_names=['x_in'], output_names=['x_out'], opset_version=11, enable_onnx_checker=True)

C-program to run the model (linked against dll:

int main(int argc, const char* argv[]) {
    //std::wstring model_file(L"kernel_384.onnx");
    std::wstring model_file(L"kernel_512.onnx");
    Ort::Env env(ORT_LOGGING_LEVEL_INFO, "test");
    Ort::SessionOptions session_options;
    session_options.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_ALL);
    session_options.DisableMemPattern(); 
    session_options.SetExecutionMode(ExecutionMode::ORT_SEQUENTIAL);
    Ort::ThrowOnError(OrtSessionOptionsAppendExecutionProvider_DML(session_options, 0));
    Ort::Session session(env, model_file.c_str(), session_options);  
    Ort::AllocatorWithDefaultOptions allocator;
    std::vector<int64_t> input_node_dims = session.GetInputTypeInfo(0).GetTensorTypeAndShapeInfo().GetShape();
    std::vector<const char*> input_names = { "x_in" };
    std::vector<const char*> output_names = { "x_out" };
    std::vector<Ort::Value> input_tensors;
    size_t input_tensor_size = input_node_dims[0] * input_node_dims[1] * input_node_dims[2];
    std::vector<float> input_tensor_values(input_tensor_size);
    auto memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
    input_tensors.emplace_back(Ort::Value::CreateTensor<float>(memory_info, input_tensor_values.data(), input_tensor_size, input_node_dims.data(), input_node_dims.size())); 

    for (int loop_idx = 0; loop_idx < 10; loop_idx++) {
        std::cout << "Loop number " << loop_idx + 1 << std::endl;
        std::vector<Ort::Value> output_tensors = session.Run(Ort::RunOptions( nullptr ), input_names.data(), input_tensors.data(), input_names.size(), output_names.data(), output_names.size());
    }
    return EXIT_SUCCESS;
}

Expected behavior Expecting multiple calls to Session::Run() to work without crashing also with DML.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.

j-paulus commented 3 years ago

Update: The exact same problem is present in ONNX Runtime 1.6.0.

j-paulus commented 3 years ago

Update: The problem persists in ONNX Runtime 1.7.0 with DML 1.4.2.

fdwr commented 2 years ago

@j-paulus Sorry this got buried in the notification noise - I'm cleaning up old issues. :/ I was able to repro seeing a C++ Ort::Exception::Exception with ORT 1.10 via the value 768 on Nvidia Quadro P400 (thanks for the repro snippets). The output_tensors is size 1, but the contained Ort::Value is empty. After enabling the debug layer via dxcpl.exe...

...I saw the following output:

D3D12 ERROR: ID3D12Device::RemoveDevice: Device removal has been triggered for the following reason
DXGI_ERROR_DEVICE_HUNG: The Device took an unreasonable amount of time to execute its commands, or the hardware crashed/hung.
As a result, the TDR (Timeout Detection and Recovery) mechanism has been triggered.
The current Device Context was executing commands when the hang occurred.
The application may want to respawn and fallback to less aggressive use of the display hardware).
[ EXECUTION ERROR #232: DEVICE_REMOVAL_PROCESS_AT_FAULT]

I suspect the ConvTranspose is taking too long.

microsoft / onnxruntime

Consecutive calls to Session::Run() with DML EP crashing #6003

Urgency