nod-ai / SHARK-Studio

SHARK Studio -- Web UI for SHARK+IREE High Performance Machine Learning Distribution
Apache License 2.0
1.42k stars 171 forks source link

[SD] (RDNA3) SD1.4 768x768 generates noise output. #1508

Open monorimet opened 1 year ago

monorimet commented 1 year ago

Here is the terminal output for this run on W7900:

(shark.venv) PS SHARK\apps\stable_diffusion\web> python index.py --clear_all --share --ui=web
shark_tank local cache is located at Users\ean\.local/shark_tank/ . You may change this by setting the --local_tank_cache= flag
CLEARING ALL, EXPECT SEVERAL MINUTES TO RECOMPILE
vulkan devices are available.
cuda devices are available.

Running on local URL:  http://0.0.0.0:8080
Running on public URL: https://6e423d2611e49dfaa5.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
Found device AMD Radeon PRO W7900. Using target triple rdna3-w7900-windows.
Tuned models are currently not supported for this setting.
No vmfb found. Compiling and saving to SHARK\apps\stable_diffusion\web\euler_scale_model_input_1_768_768_vulkan_fp16.vmfb
Configuring for device:vulkan://00000000-6300-0000-0000-000000000000
Using target triple -iree-vulkan-target-triple=rdna3-w7900-windows from command line args
Saved vmfb in SHARK\apps\stable_diffusion\web\euler_scale_model_input_1_768_768_vulkan_fp16.vmfb.
No vmfb found. Compiling and saving to SHARK\apps\stable_diffusion\web\euler_step_1_768_768_vulkan_fp16.vmfb
Configuring for device:vulkan://00000000-6300-0000-0000-000000000000
Using target triple -iree-vulkan-target-triple=rdna3-w7900-windows from command line args
Saved vmfb in SHARK\apps\stable_diffusion\web\euler_step_1_768_768_vulkan_fp16.vmfb.
use_tuned? sharkify: False
_1_64_768_768_fp16_stable-diffusion-v1-4
No vmfb found. Compiling and saving to SHARK\apps\stable_diffusion\web\clip_1_64_768_768_fp16_stable-diffusion-v1-4_vulkan.vmfb
Configuring for device:vulkan://00000000-6300-0000-0000-000000000000
Using target triple -iree-vulkan-target-triple=rdna3-w7900-windows from command line args
Saved vmfb in SHARK\apps\stable_diffusion\web\clip_1_64_768_768_fp16_stable-diffusion-v1-4_vulkan.vmfb.
Downloading (…)ch_model.safetensors: 100%|██████| 3.44G/3.44G [00:31<00:00, 108MB/s]
mat1 and mat2 shapes cannot be multiplied (128x1024 and 768x320)
Retrying with a different base model configuration
No vmfb found. Compiling and saving to SHARK\apps\stable_diffusion\web\unet_1_64_768_768_fp16_stable-diffusion-v1-4_vulkan.vmfb
Configuring for device:vulkan://00000000-6300-0000-0000-000000000000
Using target triple -iree-vulkan-target-triple=rdna3-w7900-windows from command line args
Saved vmfb in SHARK\apps\stable_diffusion\web\unet_1_64_768_768_fp16_stable-diffusion-v1-4_vulkan.vmfb.
Downloading (…)main/vae/config.json: 100%|█████████████| 551/551 [00:00<?, ?B/s]
Downloading (…)ch_model.safetensors: 100%|█████████████| 335M/335M [00:03<00:00, 110MB/s]
No vmfb found. Compiling and saving to SHARK\apps\stable_diffusion\web\vae_1_64_768_768_fp16_stable-diffusion-v1-4_vulkan.vmfb
Configuring for device:vulkan://00000000-6300-0000-0000-000000000000
Using target triple -iree-vulkan-target-triple=rdna3-w7900-windows from command line args
Saved vmfb in SHARK\apps\stable_diffusion\web\vae_1_64_768_768_fp16_stable-diffusion-v1-4_vulkan.vmfb.

And a screenshot of the webui with the generated noise image:

7900noise768x2

NeedsMoar commented 1 year ago

I get the same thing, 7900xtx. I'm pretty sure it was happening on other models as well. If you're running on a W7900 I guess it has nothing to do with vram (didn't really think that was it, but still).

NeedsMoar commented 1 year ago

From what I've read some of the VAEs models include are bugged in FP16 mode. I've found a few that will produce random all-black results. AFAICT most of the anime-specific ones are messed up, and end up producing terribly low contrast images that aren't recoverable since shark outputs 8bpc PNG.

It'll happen at other resolutions too. I think I was attempting to run something between 512 and 768 on one side and 768 on the long side. When you do that the performance absolutely tanks (non-tuned files aren't too bad anymore but I'm talking like a couple of seconds per iteration) and you get stretched noise. 640x640 will do it, too. Nearly every other method of running these can handle somewhat arbitrary dimensions as long as they're divisible by 8. When I messed with Comfy's image composition example it runs SD-1.4 based checkpoints on a starting latent at 1280x708 which intializes a background to place three characters (also as partly generated latents @ 256x512) to makes sure they all progress towards the prompts they were given, then some more iterations are run on the combined version before the latents themselves are upscaled to target 1920x1080 final resolution and the last 9 iterations of the checkpoint are run on these latents and sent through a normal VAE. It takes a long time, both because directml is much slower to begin with at a given resolution and because it's using all 24GB of vram at least partially due to an outstanding DirectML bug on AMD that makes it non-trivial to get in-use memory and alloc / dealloc properly, but it's proof that there's nothing model-side preventing this from working.

AFAIK in something like automatic1111 the included tiled generation plugin is there so people with low vram can still generate large images, not because large latents couldn't be processed as-is. I think diffusers just handles the tiling itself but on the card so it'll be much faster if you have enough memory to store it all at once.

Anyway, If I try to use a .safetensors for one of the rare models trained starting from stable-diffusion-2 or just the base 768-v-ema.ckpt as a checkpoint @ 768x768, something messes up with shark when it's determining which CLIP encoder it should pull and the process errors out with:

Some weights of the model checkpoint at laion/CLIP-ViT-bigG-14-laion2B-39B-b160k were not used when initializing CLIPTextModelWithProjection {huge list of layers snipped} RuntimeError: Error(s) in loading state_dict for CLIPTextModelWithProjection: {huge list of missing keys snipped} then finally a giant mix of size mismatches snipped that all center around the model wanting 1280 as a dimension but the CLIP model using 1024...

Now oddly when loading the model straight from huggingface that doesn't occur, but I get noise that looks suspiciously like there's some image in it running the untuned model. From what I could tell this is due to issues with the original stable-diffusion-2 VAE being fp32-only. Supposedly those need to run in fp32 to avoid black images in normal UIs (CUDA, primarily) so there's a chance whatever was causing black images in those was causing noise in shark. I don't know. I'm going to experiment with some different VAEs and see if I can find one that works at higher resolutions. There's an fp16 vae for 2 and 2.1 now, and the sdxl version was released.