vladmandic / automatic

SD.Next: Advanced Implementation of Stable Diffusion and other Diffusion-based generative image models
https://github.com/vladmandic/automatic
GNU Affero General Public License v3.0
5.72k stars 426 forks source link

[Issue]: stable-fast errors with SDXL #2991

Closed MysticDaedra closed 4 months ago

MysticDaedra commented 8 months ago

Issue Description

Trying stable-fast for the first time (apparently I hadn't had it installed properly before), on a fresh --reinstall. Latest dev, info below. With SDXL, it returns a bunch of errors and either freezes or generates a blank image (not even black, just... non-existent).

Works fine it seems with SD 1.5.

Version Platform Description

Python 3.10.6 (I know... need to update) Windows 11 Professional Dev a0fd8210 RTX 3070 8gb Torch 2.2.1, CUDA 12.1, Cudnn 8801 Diffusers 0.27.0, Gradio 3.43.2 Mozilla Firefox

Relevant log output

Posted in discord: https://discord.com/channels/1101998836328697867/1130536562422186044/1220256125686120468

Backend

Diffusers

Branch

Dev

Model

SD-XL

Acknowledgements

vladmandic commented 7 months ago

please upload the actual log here, its hard to follow link-to-link-to-log to then download it.

MysticDaedra commented 7 months ago

I totally forgot I could just upload the file, apologies. sdnext.log

vladmandic commented 7 months ago

i cannot reproduce the problem, see my log below:

09:54:34-225971 INFO     Autodetect: model="Stable Diffusion XL" class=StableDiffusionXLPipeline file="/mnt/models/Stable-diffusion/sdxl/miaanimeSFWNSFWSDXL_v40.safetensors" size=6617MB
09:54:39-637988 DEBUG    Setting model: pipeline=StableDiffusionXLPipeline config={'low_cpu_mem_usage': True, 'torch_dtype': torch.float16, 'load_connected_pipeline': True, 'extract_ema': True, 'original_config_file': 'configs/sd_xl_base.yaml', 'use_safetensors': True}
09:54:39-639071 DEBUG    Setting model: enable VAE slicing
09:54:42-886140 INFO     Model compile: pipeline=StableDiffusionXLPipeline mode=reduce-overhead backend=stable-fast fullgraph=True compile=['Model', 'VAE']
09:54:42-988677 INFO     Model compile: task='Stable-fast' config={'memory_format': torch.contiguous_format, 'enable_jit': True, 'enable_jit_freeze': True, 'preserve_parameters': True, 'enable_cnn_optimization': True, 'enable_fused_linear_geglu': True, 'prefer_lowp_gemm': True, 'enable_xformers': False, 'enable_cuda_graph': True,
                         'enable_triton': True, 'trace_scheduler': False} time=0.02
09:54:43-908654 DEBUG    GC: collected=143 device=cuda {'ram': {'used': 1.38, 'total': 47.05}, 'gpu': {'used': 8.59, 'total': 23.99}, 'retries': 0, 'oom': 0} time=0.26
09:54:43-914407 INFO     Load model: time=9.43 load=9.43 native=1024 {'ram': {'used': 1.38, 'total': 47.05}, 'gpu': {'used': 8.59, 'total': 23.99}, 'retries': 0, 'oom': 0}
09:55:19-933228 INFO     Applying hypertile: unet=320
09:55:19-951866 INFO     Base: class=StableDiffusionXLPipeline
09:55:20-319214 DEBUG    Diffuser pipeline: StableDiffusionXLPipeline task=DiffusersTaskType.TEXT_2_IMAGE set={'prompt_embeds': torch.Size([1, 77, 2048]), 'pooled_prompt_embeds': torch.Size([1, 1280]), 'negative_prompt_embeds': torch.Size([1, 77, 2048]), 'negative_pooled_prompt_embeds': torch.Size([1, 1280]), 'guidance_scale': 6,
                         'generator': device(type='cuda'), 'num_inference_steps': 10, 'eta': 1.0, 'guidance_rescale': 0.7, 'denoising_end': None, 'output_type': 'latent', 'width': 1280, 'height': 720, 'parser': 'Full parser'}
09:55:20-328863 DEBUG    Sampler: sampler="UniPC" config={'num_train_timesteps': 1000, 'beta_start': 0.00085, 'beta_end': 0.012, 'beta_schedule': 'scaled_linear', 'prediction_type': 'epsilon', 'solver_order': 2, 'thresholding': False, 'sample_max_value': 1.0, 'predict_x0': 'bh1', 'lower_order_final': True}
Progress  3.75it/s █████████████████████████████████ 100% 10/10 00:02 00:00 Base
09:55:30-702229 INFO     Saving: image="outputs/text/02031-miaanimeSFWNSFWSDXL_v40-mad max young woman character dancing and wearing.jpg" type=JPEG resolution=1280x720 size=0
09:55:30-712135 INFO     Processed: images=1 time=10.77 its=0.93 memory={'ram': {'used': 3.26, 'total': 47.05}, 'gpu': {'used': 13.96, 'total': 23.99}, 'retries': 0, 'oom': 0}
09:56:00-496385 DEBUG    Server: alive=True jobs=1 requests=76 uptime=925 memory=3.26/47.05 backend=Backend.DIFFUSERS state=idle

first try to reduce variables - don't use hypertile at the same time as stable-fast (in my case it does work, but rule of troubleshooting is always to reduce variables) and try to come up with as simple as possible reproducible scenario.

MysticDaedra commented 7 months ago

The freezing doesn't seem to be happening anymore, not sure how that as fixed, but the console errors remain. I disabled hypertile, removed all loras from the prompt, turned off adetailer. sdnext.log

vladmandic commented 7 months ago

i cannot reproduce. try setting inference mode to default no_grad?

vladmandic commented 7 months ago

any updates?

MysticDaedra commented 7 months ago

Here's with no_grad, seems to be the same error: sdnext.log

Sorry for taking so long on this, been pretty busy and didn't want to deal with it :/

MysticDaedra commented 7 months ago

I remembered that you said to disable hypertile, so here's another run with hypertile disabled. sdnext.log

vladmandic commented 7 months ago

Model compile: task='Stable-fast' config={'memory_format': torch.contiguous_format, 'enable_jit': True, 'enable_jit_freeze': True, 'preserve_parameters': True, 'enable_cnn_optimization': True, 'enable_fused_linear_geglu': True, 'prefer_lowp_gemm': True, 'enable_xformers': False, 'enable_cuda_graph': True, 'enable_triton': False, 'trace_scheduler': False} time=0.02

i just noticed that triton is not available - stable-fast works without triton in-theory-only, i never actually waited long enough for compile to finish as its incredibly slow without it.

can you try pip install triton from your venv?

also, you have model offload enabled which means model is on cpu at the time of the compile. can you try with model offloading disabled?

and pls check if same error occurs with different sdxl models?

a bit of background, torch inference mode or no grad are supposed to set all params to no-grad, but they can only do that for known and initialized params and it seems that model you're loading includes some params that are not known so they are left as-is and then later compile fails because compile requires that all param are in no-grad mode.

MysticDaedra commented 7 months ago

pip install triton returns two errors: Could not find a version that satisfies the requirement triton (from versions: none), and No matching distribution found for triton.

My understanding is that triton only works on Linux, and I'm using Windows 11 professional. Perhaps it's time to install WSL2? Looking into it, it seems a bunch of torch optimizations only work on linux as well, mainly due to triton.

Here's the log when disabling medvram: sdnext.log

Note that this was also with Juggernaut. Here's a log with medvram re-enabled but Juggernaut loaded: sdnext.log

vladmandic commented 7 months ago

My understanding is that triton only works on Linux, and I'm using Windows 11 professional. Perhaps it's time to install WSL2? Looking into it, it seems a bunch of torch optimizations only work on linux as well, mainly due to triton.

true. i suggested triton and forgot for a sec you're on windows. but yes, in general, i have zero downsides of wsl2, its my daily environment. only issue is that you do need to be somewhat familiar with linux in general. not much, but still.

Here's the log when disabling medvram

aaa, finally something different :) but not that helpful, this is a generic error stating that something is wrong betweeen torch and gpu. i typically run into those problems if i update device driver, but don't reboot and stuff like that.

Silanda commented 7 months ago

i just noticed that triton is not available - stable-fast works without triton in-theory-only, i never actually waited long enough for compile to finish as its incredibly slow without it.

FWIW, I use stable-fast in Windows without Triton. The initial compile is a bit slow, but it does work and offer a decent speed boost. However, I find it a bit awkward in general; it crashes if the output resolution is changed too much, it crashes with some (all?) Lora, etc.

On one occasion I ran into a similar problem as the OP, but unfortunately I can't for the life of me remember what it was that was causing the trouble. I'll reply again if I remember.

vladmandic commented 5 months ago

this one kinda fell through the cracks, whats the current status? regarding crashing when output resolution or lora changes - well, that's the limitation of pretty much all actual compile methods - except, expected behavior is that it needs to recompile new model execution path given changed parameters. but if your compile is not stable to start with, frequent recompiles are only going to make it worse.

sure, it should not crash on recompile. but anytime you need to change resolution or lora, compile is probably not the best option. this applies to tensorrt, torch compile, etc. - pretty much all of them.

MysticDaedra commented 5 months ago

It's been a long while since I worked with stable-fast, I want to return to it at some point, but I got too busy.

What would be nice is if the compile could be saved somehow and then if the correct models/loras/resolutions are detected, it just grabs that saved compile. I don't know if that's even possible, but it is something I've been thinking about.

On Fri, Jun 7, 2024 at 6:16 PM Vladimir Mandic @.***> wrote:

this one kinda fell through the cracks, whats the current status? regarding crashing when output resolution or lora changes - well, that's the limitation of pretty much all actual compile methods - except, expected behavior is that it needs to recompile new model execution path given changed parameters. but if your compile is not stable to start with, frequent recompiles are only going to make it worse.

sure, it should not crash on recompile. but anytime you need to change resolution or lora, compile is probably not the best option. this applies to tensorrt, torch compile, etc. - pretty much all of them.

— Reply to this email directly, view it on GitHub https://github.com/vladmandic/automatic/issues/2991#issuecomment-2155740507, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEDDLFBJCSOHXXLVLFDSLX3ZGJLPZAVCNFSM6AAAAABFA2CCLWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJVG42DANJQG4 . You are receiving this because you authored the thread.Message ID: @.***>

vladmandic commented 4 months ago

What would be nice is if the compile could be saved somehow and then if the

some can. torch-trace results can. zluda does compile and saves result. stable-fast cannot. even worse, seems like stable-fast has been abandoned by its author. too bad as it was really promising.

this is the very first sentence on stable fast repo:

Active development on stable-fast has been paused.

based on that alone, i cannot proceed much here.