lshqqytiger / stable-diffusion-webui-amdgpu

Stable Diffusion web UI
GNU Affero General Public License v3.0
1.8k stars 186 forks source link

[Bug]: Embedding Training fails with DirectML without --no-half #3

Open majorsauce opened 1 year ago

majorsauce commented 1 year ago

Is there an existing issue for this?

What happened?

Training embeddings not possible and an error is raised

Steps to reproduce the problem

  1. Start this for with cmdline "--medvram"
  2. Create a new embedding and train it with otions:
    • Lean Rate: 0.005
    • Gradient Clipping: disabled
    • Batch size: 1
    • Width: 64
    • Height: 64
    • Do not resize images: False
    • Max Steps: 2000
    • Save images with embedding in PNG chunks: False
    • Read parameters: False
    • Shuffle tags: false
    • Choose latent sampling method: once
  3. Train Embedding

What should have happened?

Embedding is trained

Commit where the problem happens

commit: ba374c74

What platforms do you use to access the UI ?

Windows

What browsers do you use to access the UI ?

Brave

Command Line Arguments

--medvram

List of extensions

None

Console logs

Training at rate of 0.005 until step 2000
Preparing dataset...
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00,  2.30s/it]
No saved optimizer exists in checkpoint
  0%|                                                                                         | 0/2000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "D:\stable-diffusion-webui-directml\modules\textual_inversion\textual_inversion.py", line 497, in train_embedding
    scaler.scale(loss).backward()
  File "D:\stable-diffusion-webui-directml\venv\lib\site-packages\torch\_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "D:\stable-diffusion-webui-directml\venv\lib\site-packages\torch\autograd\__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: m_device->CreateOperator(&opDesc, IID_PPV_ARGS(&op))


### Additional information

_No response_
lshqqytiger commented 1 year ago

I could get same error. But unfortunately it seems to be a same kind of error with out of memory (failed to allocate ... bytes of tensor). https://github.com/microsoft/DirectML/issues/169 Memory management of DirectML is not very good at this time, so we need to wait while skillful developers of Microsoft optimize it.

lshqqytiger commented 1 year ago

It may be a NYI error, not memory-related. #32

HaohuaLv commented 1 year ago

I got the same error. Have you solved it yet?