[Bug]: "The GPU device instance has been suspended" exception when Textual Inversion training on AMD GPU (RX 7800)

Checklist

[X] The issue exists after disabling all extensions
[X] The issue exists on a clean installation of webui
[ ] The issue is caused by an extension, but I believe it is caused by a bug in the webui
[X] The issue exists in the current version of the webui
[ ] The issue has not been reported before recently
[X] The issue has been reported before but has not been fixed yet

What happened?

When I launch any TI training, it fails immediately after dataset preparation with the error shown in the logs.

Steps to reproduce the problem

Launch web UI
Open "Train" tab
Select inner "Train" tab
Refresh embeddings list
Select embedding
Fill dataset path
Fill log path
Start training

What should have happened?

Training should begin

What browsers do you use to access the UI ?

Mozilla Firefox

Sysinfo

sysinfo-2024-05-19-22-23.json

Console logs

venv "C:\stable-diffusion-webui-directml\venv\Scripts\Python.exe"
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: v1.9.3-amd-13-g517aaaff
Commit hash: 517aaaff2bb1a512057d88b0284193b9f23c0b47
Installing torch and torchvision
Requirement already satisfied: torch==2.0.0 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (2.0.0)
Requirement already satisfied: torchvision==0.15.1 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (0.15.1)
Requirement already satisfied: torch-directml in C:\stable-diffusion-webui-directml\venv\lib\site-packages (0.2.0.dev230426)
Requirement already satisfied: jinja2 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from torch==2.0.0) (3.1.4)
Requirement already satisfied: filelock in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from torch==2.0.0) (3.14.0)
Requirement already satisfied: sympy in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from torch==2.0.0) (1.12)
Requirement already satisfied: networkx in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from torch==2.0.0) (3.3)
Requirement already satisfied: typing-extensions in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from torch==2.0.0) (4.11.0)
Requirement already satisfied: numpy in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from torchvision==0.15.1) (1.26.2)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from torchvision==0.15.1) (9.5.0)
Requirement already satisfied: requests in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from torchvision==0.15.1) (2.31.0)
Requirement already satisfied: MarkupSafe>=2.0 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from jinja2->torch==2.0.0) (2.1.5)
Requirement already satisfied: certifi>=2017.4.17 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from requests->torchvision==0.15.1) (2024.2.2)
Requirement already satisfied: charset-normalizer<4,>=2 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from requests->torchvision==0.15.1) (3.3.2)
Requirement already satisfied: urllib3<3,>=1.21.1 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from requests->torchvision==0.15.1) (2.2.1)
Requirement already satisfied: idna<4,>=2.5 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from requests->torchvision==0.15.1) (3.7)
Requirement already satisfied: mpmath>=0.19 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from sympy->torch==2.0.0) (1.3.0)

[notice] A new release of pip available: 22.2.1 -> 24.0
[notice] To update, run: C:\stable-diffusion-webui-directml\venv\Scripts\python.exe -m pip install --upgrade pip
You are up to date with the most recent release.
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
C:\stable-diffusion-webui-directml\venv\lib\site-packages\pytorch_lightning\utilities\distributed.py:258: LightningDeprecationWarning: `pytorch_lightning.utilities.distributed.rank_zero_only` has been deprecated in v1.8.1 and will be removed in v2.0.0. You can import it from `pytorch_lightning.utilities` instead.
  rank_zero_deprecation(
Launching Web UI with arguments: --use-directml --update-all-extensions --opt-sub-quad-attention --opt-split-attention --no-half --upcast-sampling --update-check --reinstall-torch
ONNX: version=1.18.0 provider=DmlExecutionProvider, available=['DmlExecutionProvider', 'CPUExecutionProvider']
==============================================================================
You are running torch 2.0.0+cpu.
The program is tested to work with torch 2.1.2.
To reinstall the desired version, run with commandline flag --reinstall-torch.
Beware that this will cause a lot of large files to be downloaded, as well as
there are reports of issues with training tab on the latest version.

Use --skip-version-check commandline argument to disable this check.
==============================================================================
Loading weights [6ce0161689] from C:\stable-diffusion-webui-directml\models\Stable-diffusion\v1-5-pruned-emaonly.safetensors
Creating model from config: C:\stable-diffusion-webui-directml\configs\v1-inference.yaml
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
C:\stable-diffusion-webui-directml\venv\lib\site-packages\huggingface_hub\file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Startup time: 10.3s (prepare environment: 12.1s, initialize shared: 1.2s, load scripts: 1.2s, create ui: 0.4s, gradio launch: 0.3s).
Applying attention optimization: Doggettx... done.
Model loaded in 3.2s (load weights from disk: 0.5s, create model: 0.3s, apply weights to model: 2.2s).
Training at rate of 0.005 until step 100000
Preparing dataset...
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:02<00:00,  5.67it/s]
  0%|                                                                                       | 0/100000 [00:00<?, ?it/s]*** Error training embedding
    Traceback (most recent call last):
      File "C:\stable-diffusion-webui-directml\modules\textual_inversion\textual_inversion.py", line 553, in train_embedding
        scaler.scale(loss).backward()
      File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\torch\_tensor.py", line 487, in backward
        torch.autograd.backward(
      File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\torch\autograd\__init__.py", line 200, in backward
        Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    RuntimeError: The GPU device instance has been suspended. Use GetDeviceRemovedReason to determine the appropriate action.

---
Applying attention optimization: Doggettx... done.

Additional information

I see similar reports for other parts of the UI, like this one https://github.com/lshqqytiger/stable-diffusion-webui-amdgpu/issues/71#issue-1661243199. Some reports say that the issue is caused when too much VRAM is allocated but I don't think that's the case here. Opening this separate ticket for training specifically.

EDIT: I just realized this may be important - normal image generation functionality is OK (i.e txt2image and img2img), I only experience problems with training.

lshqqytiger / stable-diffusion-webui-amdgpu