Open NeedsMoar opened 1 year ago
Loading module D:\Programs\Shark Latest\clip_1_64_512_512_fp16_stable-diffusion-xl-base-1_vulkan.vmfb...
Downloading (…)ain/unet/config.json: 100%|████████████████████████████████████████████████| 1.68k/1.68k [00:00<?, ?B/s]
Downloading (…)ch_model.safetensors: 11%|████▍ | 1.14G/10.3G [00:18<05:25, 28.0MB/s]Saved vmfb in D:\Programs\Shark Latest\unet_1_64_512_512_fp16_tuned_stable-diffusion-2-1-base_vulkan.vmfb.
Downloading (…)ch_model.safetensors: 15%|█████▊ | 1.50G/10.3G [00:23<02:07, 68.8MB/s]Loading module D:\Programs\Shark Latest\unet_1_64_512_512_fp16_tuned_stable-diffusion-2-1-base_vulkan.vmfb...
50it [00:08, 5.79it/s].safetensors: 21%|████████▍ | 2.17G/10.3G [00:40<01:48, 74.3MB/s]
Downloading (…)ch_model.safetensors: 26%|██████████▌ | 2.72G/10.3G [00:52<04:48, 26.2MB/s]No vmfb found. Compiling and saving to D:\Programs\Shark Latest\vae_1_64_512_512_fp16_tuned_stable-diffusion-2-1-base_vulkan.vmfb
Configuring for device:vulkan://00000000-0b00-0000-0000-000000000000
Using target triple -iree-vulkan-target-triple=rdna2-unknown-windows from command line args
Downloading (…)ch_model.safetensors: 31%|████████████▎ | 3.16G/10.3G [01:01<01:30, 79.0MB/s]Saved vmfb in D:\Programs\Shark Latest\vae_1_64_512_512_fp16_tuned_stable-diffusion-2-1-base_vulkan.vmfb.
Downloading (…)ch_model.safetensors: 31%|████████████▎ | 3.17G/10.3G [01:01<01:27, 81.0MB/s]Loading module D:\Programs\Shark Latest\vae_1_64_512_512_fp16_tuned_stable-diffusion-2-1-base_vulkan.vmfb...
::: Detailed report (took longer than 2.5s):
+1.0023117065429688ms: get_iree_runtime_config
+4.001617431640625ms: mmap D:\Programs\Shark Latest\vae_1_64_512_512_fp16_tuned_stable-diffusion-2-1-base_vulkan.vmfb
+4.001617431640625ms: ireert.SystemContext created
+8807.002067565918ms: module initialized
Downloading (…)ch_model.safetensors: 100%|████████████████████████████████████████| 10.3G/10.3G [02:49<00:00, 60.7MB/s]
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
Traceback (most recent call last):
File "gradio\routes.py", line 488, in run_predict
File "gradio\blocks.py", line 1431, in process_api
File "gradio\blocks.py", line 1123, in call_function
File "gradio\utils.py", line 349, in async_iteration
File "gradio\utils.py", line 342, in __anext__
File "anyio\to_thread.py", line 33, in run_sync
File "anyio\_backends\_asyncio.py", line 2101, in run_sync_in_worker_thread
File "anyio\_backends\_asyncio.py", line 828, in run
File "gradio\utils.py", line 325, in run_sync_iterator_async
File "gradio\utils.py", line 694, in gen_wrapper
File "ui\txt2img_ui.py", line 188, in txt2img_inf
File "apps\stable_diffusion\src\pipelines\pipeline_shark_stable_diffusion_txt2img.py", line 134, in generate_images
File "apps\stable_diffusion\src\pipelines\pipeline_shark_stable_diffusion_utils.py", line 235, in produce_img_latents
File "apps\stable_diffusion\src\pipelines\pipeline_shark_stable_diffusion_utils.py", line 114, in load_unet
File "apps\stable_diffusion\src\models\model_wrappers.py", line 855, in unet
File "apps\stable_diffusion\src\models\model_wrappers.py", line 850, in unet
File "apps\stable_diffusion\src\models\model_wrappers.py", line 63, in check_compilation
SystemExit: Could not compile Unet. Please create an issue with the detailed log at https://github.com/nod-ai/SHARK/issues
Found device AMD Radeon RX 6700 XT. Using target triple rdna2-unknown-windows.
Tuned models are currently not supported for this setting.
Downloading (…)cheduler_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 479/479 [00:00<?, ?B/s]
No vmfb found. Compiling and saving to D:\Programs\Shark Latest\euler_scale_model_input_1_768_768_vulkan_fp16.vmfb
Configuring for device:vulkan://00000000-0b00-0000-0000-000000000000
Using target triple -iree-vulkan-target-triple=rdna2-unknown-windows from command line args
Saved vmfb in D:\Programs\Shark Latest\euler_scale_model_input_1_768_768_vulkan_fp16.vmfb.
Loading module D:\Programs\Shark Latest\euler_scale_model_input_1_768_768_vulkan_fp16.vmfb...
No vmfb found. Compiling and saving to D:\Programs\Shark Latest\euler_step_1_768_768_vulkan_fp16.vmfb
Configuring for device:vulkan://00000000-0b00-0000-0000-000000000000
Using target triple -iree-vulkan-target-triple=rdna2-unknown-windows from command line args
Saved vmfb in D:\Programs\Shark Latest\euler_step_1_768_768_vulkan_fp16.vmfb.
Loading module D:\Programs\Shark Latest\euler_step_1_768_768_vulkan_fp16.vmfb...
use_tuned? sharkify: False
_1_64_768_768_fp16_stable-diffusion-xl-refiner-1
Downloading (…)ain/unet/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.71k/1.71k [00:00<?, ?B/s]
Downloading (…)ch_model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.04G/9.04G [02:07<00:00, 70.9MB/s]
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
argument of type 'NoneType' is not iterable
Retrying with a different base model configuration
Traceback (most recent call last):
File "gradio\routes.py", line 488, in run_predict
File "gradio\blocks.py", line 1431, in process_api
File "gradio\blocks.py", line 1123, in call_function
File "gradio\utils.py", line 349, in async_iteration
File "gradio\utils.py", line 342, in __anext__
File "anyio\to_thread.py", line 33, in run_sync
File "anyio\_backends\_asyncio.py", line 2101, in run_sync_in_worker_thread
File "anyio\_backends\_asyncio.py", line 828, in run
File "gradio\utils.py", line 325, in run_sync_iterator_async
File "gradio\utils.py", line 694, in gen_wrapper
File "ui\txt2img_ui.py", line 156, in txt2img_inf
File "apps\stable_diffusion\src\pipelines\pipeline_shark_stable_diffusion_utils.py", line 389, in from_pretrained
File "apps\stable_diffusion\src\pipelines\pipeline_shark_stable_diffusion_txt2img.py", line 51, in __init__
File "apps\stable_diffusion\src\pipelines\pipeline_shark_stable_diffusion_utils.py", line 82, in __init__
File "apps\stable_diffusion\src\pipelines\pipeline_shark_stable_diffusion_utils.py", line 114, in load_unet
File "apps\stable_diffusion\src\models\model_wrappers.py", line 855, in unet
File "apps\stable_diffusion\src\models\model_wrappers.py", line 850, in unet
File "apps\stable_diffusion\src\models\model_wrappers.py", line 63, in check_compilation
SystemExit: Could not compile Unet. Please create an issue with the detailed log at https://github.com/nod-ai/SHARK/issues
there's not going to be much reason left to use it.
I think SHARK will still be of use. Even on linux I have to use SHARK since my cpu apparently does not support the PCIe atomics required for ROCm. SDXL support would be really nice since I can't get stable diffusion to run with ONNX.
It's also still faster than everything else on AMD; AFAICT the only reason anybody is passing it up is the lack of the level of functionality other UIs have. I started messing around with the QRCode generator ControlNets and putting up with the lower speed of Comfy (on Windows, where it's WAY lower speed) and got hooked on being able to plug in arbitrary sampler chains, multiple LoRAs, getting LoRAs I thought were broken to work because the weight is actually adjustable, etc.
Off topic but just FYI the atomics thing is a PCIe 3.0 feature which has been around for nearly 10 years, if you have that at the least it should be incredibly rare that it isn't there: https://rocm.docs.amd.com/en/latest/understand/More-about-how-ROCm-uses-PCIe-Atomics.html
From the linux kernel patch (2015) there: :We've been testing this prior to upstreaming the client code and ran into a problem. When the client driver (amdgpu) is running within a virtual machine on the physical PCI function (not SR-IOV) the hypervisor virtualizes the PCI configuration space and blocks writes to DEVCTL2.ATOMICOP_REQUESTER_ENABLE.
That's an old thread so you might need to dig around to find a way to enable this unconditionally if you have virtualization turned on (pretty sure modern linux runs its hypervisor if functionality is enabled?) or enable SR-IOV in bios if it isn't on, or just turn virtualization off if you don't use it for anything and Linux handles switching between the states of those BIOS settings without screwing anything up. I haven't tried it so I'd google it first. :D
Is there an overview of what needs to be done to properly support sdxl?
Is there an overview of what needs to be done to properly support sdxl?
Yes, I've been looking for a way to do it as well, as after messing around with other SD platforms, SHARK is really all I use, because of the speed.
Currently the only way to run this model on Windows without writing your own python for it to use an ONNX version (no good ONNX UIs exist) is via the pretty bad directml support in ComfyUI or one of the Automatic1111 branches that supports it. The problem there is that neither can manage GPU memory properly so although the model + refiner should be runnable within 24GB, they're not able to most of the time. Setting up a comfy pipeline to run the main model @ 1024x1024, upscale the latents by, say, 1.5x, then run the refiner on them unpredictably OOMs on about 1/4 of runs, for example. Aside from that, DirectML is slow. From random reports I've seen of the model speed on CUDA it's not really any slower than you'd expect from the increase in size and possibly not affected as much by that as expected.
Could get some people who don't want complicated config to switch over, at least for that.
Since RoCM + HIP drivers have been released for Windows and the other UIs are just waiting on MIOpen as an upstream requirement of Torch to port to it and have feature parity with Linux (probably wider support since tons of consumer GPUs are supported on RoCM for Windows), if there's not anything but speed going for Shark and it keeps its current restrictions on tuned sizes / lack of proper SD 2.1 768x768 support / single lora with fixed weight only / recompile on every size / model / lora change there's not going to be much reason left to use it. I don't have any stake in that but you might want to look into some of those things as well. Arbitrary sizes working at full speed (or at least expected untuned speed) would be another fantastic item. For example, trying to generate a 16:9 image for later upscale (768x432) drops iteration speeds to about 1it/s which is far slower than DirectML non-optimized. This isn't particularly intuitive or hinted at anywhere in the UI, but it's always been like that. 768x512 untuned runs at around 10it/s, tuned is more like 13-14 depending on the sampler. 512x512 can be up to 28it/s depending on the model, which is faster than the benchmarks I've seen for the module version of the A100.