[Issue]: unable to run any sdxl models with AMD rx6600 on windows

patientx commented 10 months ago

Issue Description

command line parameters : webui.bat --backend diffusers --medvram or just webui.bat --backend diffusers

Started the app with diffusers backend and medvram and used safe option so no extension will interfere, also the "Enable model CPU offload (--medvram)" is selected in diffusers menu in options.

Version Platform Description

Python 3.10.11 on Windows Version: app=sd.next updated=2023-10-22 hash=be75ed7e url=https://github.com/vladmandic/automatic/tree/master Platform: arch=AMD64 cpu=AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD system=Windows release=Windows-10-10.0.19045-SP0 python=3.10.11 Using CPU-only Torch ... Running in safe mode without user extensions Extension preload: {'extensions-builtin': 0.0} Command line args: ['--backend', 'diffusers', '--medvram', '--safe'] medvram=True backend=diffusers safe=True Engine: backend=Backend.DIFFUSERS compute=cpu mode=no_grad device=cpu cross-optimization="Sub-quadratic" Device: ... Select: model="copaxTimelessxlSDXL1_v7 [55818ae18a]" Loading weights: D:\stable-diffusion-webui-directml\models\Stable-diffusion\copaxTimelessxlSDXL1_v7.safetensors ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/6.9 GB -:--:-- Torch override dtype: no-half set Torch override VAE dtype: no-half set Setting Torch parameters: device=cpu dtype=torch.float32 vae=torch.float32 unet=torch.float32 context=no_grad fp16=False bf16=False Autodetect: model="Stable Diffusion XL" class=StableDiffusionXLPipeline file="D:\stable-diffusion-webui-directml\models\Stable-diffusion\copaxTimelessxlSDXL1_v7.safetensors" size=6617MB Loaded embeddings: loaded=0 skipped=0 time=0.03s Loaded model: time=31.72s { load=31.72s } native=1024 {'ram': {'used': 6.25, 'total': 15.93}} Startup time: 54.43s { torch=8.39s gradio=1.01s diffusers=0.06s libraries=2.33s extensions=2.25s face-restore=0.51s upscalers=0.25s ui-extra-networks=0.48s ui-txt2img=0.09s ui-settings=0.16s ui-extensions=1.07s ui-defaults=0.20s launch=0.42s api=0.09s app-started=0.47s checkpoint=36.54s }

PC : ryzen 3600x , 16 gb ddr4-3200, amd rx 6600 8 gb gpu, win10

Relevant log output

UnboundLocalError: local variable 'output' referenced before assignment
17:19:06-675272 ERROR    Prompt parser encode: Torch not compiled with CUDA enabled
17:19:06-680274 INFO     Torch not compiled with CUDA enabled
17:19:06-682274 ERROR    Exception: local variable 'output' referenced before assignment
17:19:06-683274 ERROR    Arguments: args=('task(x7xkj2me0ppwql9)', 'a red car\n', '\n', [], 20, 0, None, True, True, False, 1, 1, 7, 6, 0.7, 1, 3530596996.0, -1.0, 0, 0, 0, 1024, 1024, False, 0.5, 2, 'None', False, 20, 0, 0, 5, 0.8, '', '', [], 0, False, False, 'positive', 'comma', 0, False, False, '', 0, '', [], 0, '', [], 0, '', [], False, True, False, False, False, False, 0, 0, 2, 512, 512, True, 'None', 'None', 0, 0, 0.2) kwargs={}
17:19:06-689276 ERROR    gradio call: UnboundLocalError
Traceback (most recent call last)
D:\automatic\modules\call_queue.py:34 in f                                                                                                                                                                                                                                                                                                                                                             
try:                                                                                                                                                                                
res = func(*args, **kwargs)                                                                                                                                                     
progress.record_results(id_task, res)                                                                                                                                           
D:\automatic\modules\txt2img.py:66 in txt2img                                                                                                                                                        
if processed is None:                                                                                                                                                                       
processed = processing.process_images(p)                                                                                                                                                
p.close()                                                                                                                                                                                   
D:\automatic\modules\processing.py:687 in process_images                                                                                                                                             
with context_hypertile_vae(p), context_hypertile_unet(p):                                                                                                                         
res = process_images_inner(p)                                                                                                                                                 
finally:                                                                                                                                                                                  D:\automatic\modules\processing.py:844 in process_images_inner                                                                                                                                       
from modules.processing_diffusers import process_diffusers                                                                                                                    
x_samples_ddim = process_diffusers(p, p.seeds, p.prompts, p.negative_pro                                                                                                      
else:                                                                                                                                                                             
D:\automatic\modules\processing_diffusers.py:497 in process_diffusers                                                                                                                                
if not is_refiner_enabled:                                                                                                                                                                 
results = vae_decode(latents=output.images, model=shared.sd_model, full_quality=                                                                                                                                                                                                                                                                                                      
UnboundLocalError: local variable 'output' referenced before assignment

Backend

Diffusers

Model

SD-XL

Acknowledgements

[X] I have read the above and searched for existing issues
[X] I confirm that this is classified correctly and its not an extension issue

vladmandic commented 10 months ago

error indicates that there was nothing generated which was not an expected state. but i don't know why.

i've just pushed an update - please update and re-run with webui --debug to capture more info.

patientx commented 10 months ago

Thanks will do in an hour when I am able

patientx commented 10 months ago

All right after trying a bit more 1st : webui.bat --backend diffusers --use-directml --medvram --debug with medvram it can't even complete loading the model ,

so for the last one I used lowvram it is at least able to load the model ::

::::::::	sd	STATUS	sd_models	Select: model="copaxTimelessxlSDXL1_v7 [55818ae18a]"	sd	DEBUG	sd_models	Load model weights: existing=False target=D:\stable-diffusion-webui-directml\models\Stable-diffusion\copaxTimelessxlSDXL1_v7.safetensors info=None	sd	STATUS	devices	Torch override dtype: no-half set	sd	STATUS	devices	Torch override VAE dtype: no-half set	sd	DEBUG	devices	Desired Torch parameters: dtype=FP32 no-half=True no-half-vae=True upscast=False	sd	STATUS	devices	Setting Torch parameters: device=privateuseone:0 dtype=torch.float32 vae=torch.float32 unet=torch.float32 context=no_grad fp16=False bf16=False	sd	STATUS	sd_models	Autodetect: model="Stable Diffusion XL" class=StableDiffusionXLPipeline file="D:\stable-diffusion-webui-directml\models\Stable-diffusion\copaxTimelessxlSDXL1_v7.safetensors" size=6617MB	sd	DEBUG	sd_models	Setting model: pipeline=StableDiffusionXLPipeline config={'low_cpu_mem_usage': True, 'torch_dtype': torch.float32, 'load_connected_pipeline': True, 'extract_ema': True, 'force_zeros_for_empty_prompt ': True, 'requires_aesthetics_score': False, 'use_safetensors': True}

After this with lowvram option I am able to create one image but after that no matter what "RuntimeError: Could not allocate tensor with 1073741824 bytes. There is not enough GPU video memory available!" crops up in the end and I restart the app but have to restart pc or just reset the gpu with an app if I want to even regenerate another image. So it is possible to use "sdxl" with a 8gb amd gpu with directml and lowvram option enabled but not practical since it only works for one image.

But even ignoring the long step time, I wasn't able to create a second image.

vladmandic commented 10 months ago

So it is possible to use "sdxl" with a 8gb amd gpu with directml and lowvram option enabled but not practical since it only works for one image.

don't jump to conclusions. plenty of people are using sdxl with directml and 8gb and just medvram which is muuuuch faster than lowvram.

and since you've posted a very short log section, i don't even see if directml is installed and used.

btw, when quoting logs, use triple-tick at start and end:

```

patientx commented 10 months ago

attaching full log with medvram and lowvram seperately.

medvram : sdnext.log

Tried move model from vram to ram for codeformer (face restoration after generation) it enabled me to generate at least a second time on the same run so far.

lowvram.log

vladmandic commented 10 months ago

i don't have exact hw at hand to state what the best settings are, but to start with, i'd stop using --medvram and --lowvram completely, set model autoload to disabled (so you can start server without automatically loading a model) and then set settings -> diffusers as desired.

you can start with this:

and for more, best to ask for best practises on discord as there are plenty of users with amd gpus and using directml.

patientx commented 10 months ago

thanks a lot gonna play with those settings now , the results are better than sd 1.5 for except for long generation times ofc :)

ThePixelDiffusionPirate commented 10 months ago

For me it has never worked with SDXL under Win 11, AMD 5700xt 8GB. No matter how much I read on Wiki and Discord, it just doesn't work^^ I'm either too stupid or unlucky :-) even with 512 x 512 There is not enough GPU video memory available....

patientx commented 10 months ago

For me it has never worked with SDXL under Win 11, AMD 5700xt 8GB. No matter how much I read on Wiki and Discord, it just doesn't work^^ I'm either too stupid or unlucky :-) even with 512 x 512 There is not enough GPU video memory available....

yes this. but everyone keep saying "it works" no it doesn't, for people who has 8 gb nvidia cards maybe not amd. For the record it gave memory errors after 3 gens this time , and after 1 next ...

vladmandic commented 10 months ago

closing the issue as this is not a product issue, but a tuning thing. like i said, discussions or discord (recommended). and if you want to get better feedback, you may adjust the tone otherwise ppl are not inclined to help as much - after all, this is free and open & source.

vladmandic / automatic