[Bug]: Embedding Training fails with DirectML without --no-half

majorsauce commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What happened?

Training embeddings not possible and an error is raised

Steps to reproduce the problem

Start this for with cmdline "--medvram"
Create a new embedding and train it with otions:
- Lean Rate: 0.005
- Gradient Clipping: disabled
- Batch size: 1
- Width: 64
- Height: 64
- Do not resize images: False
- Max Steps: 2000
- Save images with embedding in PNG chunks: False
- Read parameters: False
- Shuffle tags: false
- Choose latent sampling method: once
Train Embedding

What should have happened?

Embedding is trained

Commit where the problem happens

commit: ba374c74

What platforms do you use to access the UI ?

Windows

What browsers do you use to access the UI ?

Brave

Command Line Arguments

--medvram

List of extensions

None

Console logs

Training at rate of 0.005 until step 2000
Preparing dataset...
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00,  2.30s/it]
No saved optimizer exists in checkpoint
  0%|                                                                                         | 0/2000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "D:\stable-diffusion-webui-directml\modules\textual_inversion\textual_inversion.py", line 497, in train_embedding
    scaler.scale(loss).backward()
  File "D:\stable-diffusion-webui-directml\venv\lib\site-packages\torch\_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "D:\stable-diffusion-webui-directml\venv\lib\site-packages\torch\autograd\__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: m_device->CreateOperator(&opDesc, IID_PPV_ARGS(&op))



### Additional information

_No response_

lshqqytiger commented 1 year ago

I could get same error. But unfortunately it seems to be a same kind of error with out of memory (failed to allocate ... bytes of tensor). https://github.com/microsoft/DirectML/issues/169 Memory management of DirectML is not very good at this time, so we need to wait while skillful developers of Microsoft optimize it.

lshqqytiger commented 1 year ago

It may be a NYI error, not memory-related. #32

HaohuaLv commented 1 year ago

I got the same error. Have you solved it yet?

lshqqytiger / stable-diffusion-webui-amdgpu