lshqqytiger / stable-diffusion-webui-amdgpu

Stable Diffusion web UI
GNU Affero General Public License v3.0
1.86k stars 191 forks source link

[Bug]: "The GPU device instance has been suspended" exception when Textual Inversion training on AMD GPU (RX 7800) #465

Open DThaiPome opened 5 months ago

DThaiPome commented 5 months ago

Checklist

What happened?

When I launch any TI training, it fails immediately after dataset preparation with the error shown in the logs.

Steps to reproduce the problem

  1. Launch web UI
  2. Open "Train" tab
  3. Select inner "Train" tab
  4. Refresh embeddings list
  5. Select embedding
  6. Fill dataset path
  7. Fill log path
  8. Start training

What should have happened?

Training should begin

What browsers do you use to access the UI ?

Mozilla Firefox

Sysinfo

sysinfo-2024-05-19-22-23.json

Console logs

venv "C:\stable-diffusion-webui-directml\venv\Scripts\Python.exe"
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: v1.9.3-amd-13-g517aaaff
Commit hash: 517aaaff2bb1a512057d88b0284193b9f23c0b47
Installing torch and torchvision
Requirement already satisfied: torch==2.0.0 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (2.0.0)
Requirement already satisfied: torchvision==0.15.1 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (0.15.1)
Requirement already satisfied: torch-directml in C:\stable-diffusion-webui-directml\venv\lib\site-packages (0.2.0.dev230426)
Requirement already satisfied: jinja2 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from torch==2.0.0) (3.1.4)
Requirement already satisfied: filelock in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from torch==2.0.0) (3.14.0)
Requirement already satisfied: sympy in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from torch==2.0.0) (1.12)
Requirement already satisfied: networkx in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from torch==2.0.0) (3.3)
Requirement already satisfied: typing-extensions in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from torch==2.0.0) (4.11.0)
Requirement already satisfied: numpy in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from torchvision==0.15.1) (1.26.2)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from torchvision==0.15.1) (9.5.0)
Requirement already satisfied: requests in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from torchvision==0.15.1) (2.31.0)
Requirement already satisfied: MarkupSafe>=2.0 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from jinja2->torch==2.0.0) (2.1.5)
Requirement already satisfied: certifi>=2017.4.17 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from requests->torchvision==0.15.1) (2024.2.2)
Requirement already satisfied: charset-normalizer<4,>=2 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from requests->torchvision==0.15.1) (3.3.2)
Requirement already satisfied: urllib3<3,>=1.21.1 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from requests->torchvision==0.15.1) (2.2.1)
Requirement already satisfied: idna<4,>=2.5 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from requests->torchvision==0.15.1) (3.7)
Requirement already satisfied: mpmath>=0.19 in C:\stable-diffusion-webui-directml\venv\lib\site-packages (from sympy->torch==2.0.0) (1.3.0)

[notice] A new release of pip available: 22.2.1 -> 24.0
[notice] To update, run: C:\stable-diffusion-webui-directml\venv\Scripts\python.exe -m pip install --upgrade pip
You are up to date with the most recent release.
no module 'xformers'. Processing without...
no module 'xformers'. Processing without...
No module 'xformers'. Proceeding without it.
C:\stable-diffusion-webui-directml\venv\lib\site-packages\pytorch_lightning\utilities\distributed.py:258: LightningDeprecationWarning: `pytorch_lightning.utilities.distributed.rank_zero_only` has been deprecated in v1.8.1 and will be removed in v2.0.0. You can import it from `pytorch_lightning.utilities` instead.
  rank_zero_deprecation(
Launching Web UI with arguments: --use-directml --update-all-extensions --opt-sub-quad-attention --opt-split-attention --no-half --upcast-sampling --update-check --reinstall-torch
ONNX: version=1.18.0 provider=DmlExecutionProvider, available=['DmlExecutionProvider', 'CPUExecutionProvider']
==============================================================================
You are running torch 2.0.0+cpu.
The program is tested to work with torch 2.1.2.
To reinstall the desired version, run with commandline flag --reinstall-torch.
Beware that this will cause a lot of large files to be downloaded, as well as
there are reports of issues with training tab on the latest version.

Use --skip-version-check commandline argument to disable this check.
==============================================================================
Loading weights [6ce0161689] from C:\stable-diffusion-webui-directml\models\Stable-diffusion\v1-5-pruned-emaonly.safetensors
Creating model from config: C:\stable-diffusion-webui-directml\configs\v1-inference.yaml
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
C:\stable-diffusion-webui-directml\venv\lib\site-packages\huggingface_hub\file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Startup time: 10.3s (prepare environment: 12.1s, initialize shared: 1.2s, load scripts: 1.2s, create ui: 0.4s, gradio launch: 0.3s).
Applying attention optimization: Doggettx... done.
Model loaded in 3.2s (load weights from disk: 0.5s, create model: 0.3s, apply weights to model: 2.2s).
Training at rate of 0.005 until step 100000
Preparing dataset...
100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 [00:02<00:00,  5.67it/s]
  0%|                                                                                       | 0/100000 [00:00<?, ?it/s]*** Error training embedding
    Traceback (most recent call last):
      File "C:\stable-diffusion-webui-directml\modules\textual_inversion\textual_inversion.py", line 553, in train_embedding
        scaler.scale(loss).backward()
      File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\torch\_tensor.py", line 487, in backward
        torch.autograd.backward(
      File "C:\stable-diffusion-webui-directml\venv\lib\site-packages\torch\autograd\__init__.py", line 200, in backward
        Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    RuntimeError: The GPU device instance has been suspended. Use GetDeviceRemovedReason to determine the appropriate action.

---
Applying attention optimization: Doggettx... done.

Additional information

I see similar reports for other parts of the UI, like this one https://github.com/lshqqytiger/stable-diffusion-webui-amdgpu/issues/71#issue-1661243199. Some reports say that the issue is caused when too much VRAM is allocated but I don't think that's the case here. Opening this separate ticket for training specifically.

EDIT: I just realized this may be important - normal image generation functionality is OK (i.e txt2image and img2img), I only experience problems with training.

lshqqytiger commented 5 months ago

DirectML is buggy and not appropriate for training. Please use ZLUDA if possible.