nod-ai / SHARK

SHARK - High Performance Machine Learning Distribution
Apache License 2.0
1.41k stars 168 forks source link

[Feature Request]: Support SDXL Model + Refiner Pipeline #1713

Open NeedsMoar opened 1 year ago

NeedsMoar commented 1 year ago

Currently the only way to run this model on Windows without writing your own python for it to use an ONNX version (no good ONNX UIs exist) is via the pretty bad directml support in ComfyUI or one of the Automatic1111 branches that supports it. The problem there is that neither can manage GPU memory properly so although the model + refiner should be runnable within 24GB, they're not able to most of the time. Setting up a comfy pipeline to run the main model @ 1024x1024, upscale the latents by, say, 1.5x, then run the refiner on them unpredictably OOMs on about 1/4 of runs, for example. Aside from that, DirectML is slow. From random reports I've seen of the model speed on CUDA it's not really any slower than you'd expect from the increase in size and possibly not affected as much by that as expected.
Could get some people who don't want complicated config to switch over, at least for that.

Since RoCM + HIP drivers have been released for Windows and the other UIs are just waiting on MIOpen as an upstream requirement of Torch to port to it and have feature parity with Linux (probably wider support since tons of consumer GPUs are supported on RoCM for Windows), if there's not anything but speed going for Shark and it keeps its current restrictions on tuned sizes / lack of proper SD 2.1 768x768 support / single lora with fixed weight only / recompile on every size / model / lora change there's not going to be much reason left to use it. I don't have any stake in that but you might want to look into some of those things as well. Arbitrary sizes working at full speed (or at least expected untuned speed) would be another fantastic item. For example, trying to generate a 16:9 image for later upscale (768x432) drops iteration speeds to about 1it/s which is far slower than DirectML non-optimized. This isn't particularly intuitive or hinted at anywhere in the UI, but it's always been like that. 768x512 untuned runs at around 10it/s, tuned is more like 13-14 depending on the sampler. 512x512 can be up to 28it/s depending on the model, which is faster than the benchmarks I've seen for the module version of the A100.

stephen-dahl commented 1 year ago
Loading module D:\Programs\Shark Latest\clip_1_64_512_512_fp16_stable-diffusion-xl-base-1_vulkan.vmfb...
Downloading (…)ain/unet/config.json: 100%|████████████████████████████████████████████████| 1.68k/1.68k [00:00<?, ?B/s]
Downloading (…)ch_model.safetensors:  11%|████▍                                   | 1.14G/10.3G [00:18<05:25, 28.0MB/s]Saved vmfb in D:\Programs\Shark Latest\unet_1_64_512_512_fp16_tuned_stable-diffusion-2-1-base_vulkan.vmfb.
Downloading (…)ch_model.safetensors:  15%|█████▊                                  | 1.50G/10.3G [00:23<02:07, 68.8MB/s]Loading module D:\Programs\Shark Latest\unet_1_64_512_512_fp16_tuned_stable-diffusion-2-1-base_vulkan.vmfb...
50it [00:08,  5.79it/s].safetensors:  21%|████████▍                               | 2.17G/10.3G [00:40<01:48, 74.3MB/s]
Downloading (…)ch_model.safetensors:  26%|██████████▌                             | 2.72G/10.3G [00:52<04:48, 26.2MB/s]No vmfb found. Compiling and saving to D:\Programs\Shark Latest\vae_1_64_512_512_fp16_tuned_stable-diffusion-2-1-base_vulkan.vmfb
Configuring for device:vulkan://00000000-0b00-0000-0000-000000000000
Using target triple -iree-vulkan-target-triple=rdna2-unknown-windows from command line args
Downloading (…)ch_model.safetensors:  31%|████████████▎                           | 3.16G/10.3G [01:01<01:30, 79.0MB/s]Saved vmfb in D:\Programs\Shark Latest\vae_1_64_512_512_fp16_tuned_stable-diffusion-2-1-base_vulkan.vmfb.
Downloading (…)ch_model.safetensors:  31%|████████████▎                           | 3.17G/10.3G [01:01<01:27, 81.0MB/s]Loading module D:\Programs\Shark Latest\vae_1_64_512_512_fp16_tuned_stable-diffusion-2-1-base_vulkan.vmfb...
::: Detailed report (took longer than 2.5s):
  +1.0023117065429688ms: get_iree_runtime_config
  +4.001617431640625ms: mmap D:\Programs\Shark Latest\vae_1_64_512_512_fp16_tuned_stable-diffusion-2-1-base_vulkan.vmfb
  +4.001617431640625ms: ireert.SystemContext created
  +8807.002067565918ms: module initialized
Downloading (…)ch_model.safetensors: 100%|████████████████████████████████████████| 10.3G/10.3G [02:49<00:00, 60.7MB/s]
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
Traceback (most recent call last):
  File "gradio\routes.py", line 488, in run_predict
  File "gradio\blocks.py", line 1431, in process_api
  File "gradio\blocks.py", line 1123, in call_function
  File "gradio\utils.py", line 349, in async_iteration
  File "gradio\utils.py", line 342, in __anext__
  File "anyio\to_thread.py", line 33, in run_sync
  File "anyio\_backends\_asyncio.py", line 2101, in run_sync_in_worker_thread
  File "anyio\_backends\_asyncio.py", line 828, in run
  File "gradio\utils.py", line 325, in run_sync_iterator_async
  File "gradio\utils.py", line 694, in gen_wrapper
  File "ui\txt2img_ui.py", line 188, in txt2img_inf
  File "apps\stable_diffusion\src\pipelines\pipeline_shark_stable_diffusion_txt2img.py", line 134, in generate_images
  File "apps\stable_diffusion\src\pipelines\pipeline_shark_stable_diffusion_utils.py", line 235, in produce_img_latents
  File "apps\stable_diffusion\src\pipelines\pipeline_shark_stable_diffusion_utils.py", line 114, in load_unet
  File "apps\stable_diffusion\src\models\model_wrappers.py", line 855, in unet
  File "apps\stable_diffusion\src\models\model_wrappers.py", line 850, in unet
  File "apps\stable_diffusion\src\models\model_wrappers.py", line 63, in check_compilation
SystemExit: Could not compile Unet. Please create an issue with the detailed log at https://github.com/nod-ai/SHARK/issues
stephen-dahl commented 1 year ago
Found device AMD Radeon RX 6700 XT. Using target triple rdna2-unknown-windows.
Tuned models are currently not supported for this setting.
Downloading (…)cheduler_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 479/479 [00:00<?, ?B/s]
No vmfb found. Compiling and saving to D:\Programs\Shark Latest\euler_scale_model_input_1_768_768_vulkan_fp16.vmfb
Configuring for device:vulkan://00000000-0b00-0000-0000-000000000000
Using target triple -iree-vulkan-target-triple=rdna2-unknown-windows from command line args
Saved vmfb in D:\Programs\Shark Latest\euler_scale_model_input_1_768_768_vulkan_fp16.vmfb.
Loading module D:\Programs\Shark Latest\euler_scale_model_input_1_768_768_vulkan_fp16.vmfb...
No vmfb found. Compiling and saving to D:\Programs\Shark Latest\euler_step_1_768_768_vulkan_fp16.vmfb
Configuring for device:vulkan://00000000-0b00-0000-0000-000000000000
Using target triple -iree-vulkan-target-triple=rdna2-unknown-windows from command line args
Saved vmfb in D:\Programs\Shark Latest\euler_step_1_768_768_vulkan_fp16.vmfb.
Loading module D:\Programs\Shark Latest\euler_step_1_768_768_vulkan_fp16.vmfb...
use_tuned? sharkify: False
_1_64_768_768_fp16_stable-diffusion-xl-refiner-1
Downloading (…)ain/unet/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.71k/1.71k [00:00<?, ?B/s]
Downloading (…)ch_model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.04G/9.04G [02:07<00:00, 70.9MB/s]
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
Traceback (most recent call last):
  File "gradio\routes.py", line 488, in run_predict
  File "gradio\blocks.py", line 1431, in process_api
  File "gradio\blocks.py", line 1123, in call_function
  File "gradio\utils.py", line 349, in async_iteration
  File "gradio\utils.py", line 342, in __anext__
  File "anyio\to_thread.py", line 33, in run_sync
  File "anyio\_backends\_asyncio.py", line 2101, in run_sync_in_worker_thread
  File "anyio\_backends\_asyncio.py", line 828, in run
  File "gradio\utils.py", line 325, in run_sync_iterator_async
  File "gradio\utils.py", line 694, in gen_wrapper
  File "ui\txt2img_ui.py", line 156, in txt2img_inf
  File "apps\stable_diffusion\src\pipelines\pipeline_shark_stable_diffusion_utils.py", line 389, in from_pretrained
  File "apps\stable_diffusion\src\pipelines\pipeline_shark_stable_diffusion_txt2img.py", line 51, in __init__
  File "apps\stable_diffusion\src\pipelines\pipeline_shark_stable_diffusion_utils.py", line 82, in __init__
  File "apps\stable_diffusion\src\pipelines\pipeline_shark_stable_diffusion_utils.py", line 114, in load_unet
  File "apps\stable_diffusion\src\models\model_wrappers.py", line 855, in unet
  File "apps\stable_diffusion\src\models\model_wrappers.py", line 850, in unet
  File "apps\stable_diffusion\src\models\model_wrappers.py", line 63, in check_compilation
SystemExit: Could not compile Unet. Please create an issue with the detailed log at https://github.com/nod-ai/SHARK/issues
mergmann commented 1 year ago

there's not going to be much reason left to use it.

I think SHARK will still be of use. Even on linux I have to use SHARK since my cpu apparently does not support the PCIe atomics required for ROCm. SDXL support would be really nice since I can't get stable diffusion to run with ONNX.

NeedsMoar commented 1 year ago

It's also still faster than everything else on AMD; AFAICT the only reason anybody is passing it up is the lack of the level of functionality other UIs have. I started messing around with the QRCode generator ControlNets and putting up with the lower speed of Comfy (on Windows, where it's WAY lower speed) and got hooked on being able to plug in arbitrary sampler chains, multiple LoRAs, getting LoRAs I thought were broken to work because the weight is actually adjustable, etc.

Off topic but just FYI the atomics thing is a PCIe 3.0 feature which has been around for nearly 10 years, if you have that at the least it should be incredibly rare that it isn't there: https://rocm.docs.amd.com/en/latest/understand/More-about-how-ROCm-uses-PCIe-Atomics.html

From the linux kernel patch (2015) there: :We've been testing this prior to upstreaming the client code and ran into a problem. When the client driver (amdgpu) is running within a virtual machine on the physical PCI function (not SR-IOV) the hypervisor virtualizes the PCI configuration space and blocks writes to DEVCTL2.ATOMICOP_REQUESTER_ENABLE.

That's an old thread so you might need to dig around to find a way to enable this unconditionally if you have virtualization turned on (pretty sure modern linux runs its hypervisor if functionality is enabled?) or enable SR-IOV in bios if it isn't on, or just turn virtualization off if you don't use it for anything and Linux handles switching between the states of those BIOS settings without screwing anything up. I haven't tried it so I'd google it first. :D

cccyberwolke commented 10 months ago

Is there an overview of what needs to be done to properly support sdxl?

silvia95guy commented 10 months ago

Is there an overview of what needs to be done to properly support sdxl?

Yes, I've been looking for a way to do it as well, as after messing around with other SD platforms, SHARK is really all I use, because of the speed.