microsoft / DirectML

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.
MIT License
2.24k stars 298 forks source link

torch-directml: BSOD in Ryzen iGPU Environment #379

Open reid3333 opened 1 year ago

reid3333 commented 1 year ago

BSOD occurs when creating many small tensor in Ryzen iGPU environment and dedicated GPU memory is small.

Code

import math

import torch
import torch.nn as nn
import torch_directml

class Net(nn.Module):
    def __init__(self, n_params, device='cpu'):
        super().__init__()

        div = 10000
        mod = n_params % div
        w = h = int(math.sqrt(div))

        self.dummy_params = nn.ParameterList()
        for _ in range(n_params // div):
            self.dummy_params.append(nn.Parameter(torch.empty((w, h), device=device)))

        self.dummy_params.append(nn.Parameter(torch.empty(mod, device=device)))

def main():
    # device = 'cpu'
    device = torch_directml.device(0)
    n_params = 2500000000 // 4  # fp32

    # case 1. Copy large tensors from CPU to GPU
    x = torch.ones(n_params).to(device)
    print(x[0])

    # case 2. Create large tensors directly on the GPU
    x = torch.ones(n_params, device=device)
    print(x[0])

    # case 3. case 2 + Wrapped tensor by `nn.Parameter`
    x = nn.Parameter(torch.empty(n_params)).to(device)
    print(x[0])

    # case 4. Copy many small tensors from CPU to GPU
    net = Net(n_params).to(device)
    print(next(iter(net.parameters())))

    # case 5. Create many small tensors directly on the GPU
    net = Net(n_params, device=device)
    print(next(iter(net.parameters())))

if __name__ == '__main__':
    main()

Note: The above code is designed to execute all test cases sequentially for ease of viewing, but each test case actually runs individually.

Testing Environments 1

Testing Environments 2

Result

All test case results are the same in both test environments.

device=cpu

All test cases are works.

device=directml and small Dedicated GPU Memory (512MB)

device=directml and large Dedicated GPU Memory (2048MB)

All test cases are works.

The same result was obtained in an environment with twice the total amount of memory, so I do not think that lack of memory is the cause.

BSOD report from NirSoft BlueScreenView

==================================================
Dump File         : 012923-5421-01.dmp
Crash Time        : 2023/01/29 23:04:44
Bug Check String  : 
Bug Check Code    : 0x0000010e
Parameter 1       : 00000000`00000036
Parameter 2       : ffffcd02`e4b07830
Parameter 3       : ffffa400`35d3ec58
Parameter 4       : ffffa400`2e8b42c0
Caused By Driver  : watchdog.sys
Caused By Address : watchdog.sys+3ad0
File Description  : Watchdog Driver
Product Name      : Microsoft® Windows® Operating System
Company           : Microsoft Corporation
File Version      : 10.0.19041.868 (WinBuild.160101.0800)
Processor         : x64
Crash Address     : ntoskrnl.exe+3fa090
Stack Address 1   : 
Stack Address 2   : 
Stack Address 3   : 
Computer Name     : 
Full Path         : C:\Windows\Minidump\012923-5421-01.dmp
Processors Count  : 12
Major Version     : 15
Minor Version     : 19041
Dump File Size    : 2,023,660
Dump File Time    : 2023/01/29 23:05:32
==================================================

==================================================
Filename          : dxgmms2.sys
Address In Stack  : dxgmms2.sys+9ba59
From Address      : fffff801`a9de0000
To Address        : fffff801`a9ec1000
Size              : 0x000e1000
Time Stamp        : 0xf7422a06
Time String       : 2101/06/16 4:54:46
Product Name      : Microsoft® Windows® Operating System
File Description  : DirectX Graphics MMS
File Version      : 10.0.19041.2311 (WinBuild.160101.0800)
Company           : Microsoft Corporation
Full Path         : C:\Windows\system32\drivers\dxgmms2.sys
==================================================

==================================================
Filename          : watchdog.sys
Address In Stack  : watchdog.sys+3ad0
From Address      : fffff801`832a0000
To Address        : fffff801`832b8000
Size              : 0x00018000
Time Stamp        : 0xf13839ab
Time String       : 2098/03/30 13:57:15
Product Name      : Microsoft® Windows® Operating System
File Description  : Watchdog Driver
File Version      : 10.0.19041.868 (WinBuild.160101.0800)
Company           : Microsoft Corporation
Full Path         : C:\Windows\system32\drivers\watchdog.sys
==================================================

The error corresponding to bug check code 10e is VIDEO_MEMORY_MANAGEMENT_INTERNAL https://learn.microsoft.com/ja-jp/windows-hardware/drivers/debugger/bug-check-0x10e---video-memory-management-internal

Supplementary Information

aka7774 commented 1 year ago

Same DELL 4700U 512MB No bios update support.

TomArrow commented 1 year ago

I think I might be having this same issue when using lshqqytiger/stable-diffusion-webui-directml. Same issue as referenced above: https://github.com/lshqqytiger/stable-diffusion-webui-directml/issues/6

Quick copy paste: The BSOD is a bugcheck code 370 (VIDEO_MEMORY_MANAGEMENT_INTERNAL) with a parameter 0x36 which according to microsoft means "The paging request failed on a paging packet or device resume that was previously marked as unrecoverable, and was expected to succeed subsequent calls." (see: https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/bug-check-0x10e---video-memory-management-internal ). No idea what it means.

Windows 10, AMD 5700G with pretty recent drivers.

Looking in Windows Task Manager, shared GPU memory usage goes up to around 7-8 GB or so (might be remembering wrong since it happens fast) out of the 16GB that are shown as max, and then I get the blue screen.

f-cero commented 10 months ago

I believe my BSoD is related, I'm having the same stop code as @TomArrow:

Quick copy paste: The BSOD is a bugcheck code 370 (VIDEO_MEMORY_MANAGEMENT_INTERNAL) with a parameter 0x36 which according to microsoft means "The paging request failed on a paging packet or device resume that was previously marked as unrecoverable, and was expected to succeed subsequent calls." (see: https://learn.microsoft.com/en-us/windows-hardware/drivers/debugger/bug-check-0x10e---video-memory-management-internal ).

My GPU is a Vega FE, so 16 GB of VRAM shouldn't be having any issues with low memory but I can't even complete a 512² diffusion with --medvram and no XL will run period even with --lowvram.

What information is useful? I have some sense that this is a driver level issue, something in perhaps ADL or something to do with resizeable BAR. Then again, when isn't the issue a driver…

When I don't BSoD I get other memory related errors and have to close and reopen the console, restarting the UI doesn't do anything. I've been banging my head against the wall on this for a few days now, I'm not sure what to try next.