Open monorimet opened 1 year ago
I get the same thing, 7900xtx. I'm pretty sure it was happening on other models as well. If you're running on a W7900 I guess it has nothing to do with vram (didn't really think that was it, but still).
From what I've read some of the VAEs models include are bugged in FP16 mode. I've found a few that will produce random all-black results. AFAICT most of the anime-specific ones are messed up, and end up producing terribly low contrast images that aren't recoverable since shark outputs 8bpc PNG.
It'll happen at other resolutions too. I think I was attempting to run something between 512 and 768 on one side and 768 on the long side. When you do that the performance absolutely tanks (non-tuned files aren't too bad anymore but I'm talking like a couple of seconds per iteration) and you get stretched noise. 640x640 will do it, too. Nearly every other method of running these can handle somewhat arbitrary dimensions as long as they're divisible by 8. When I messed with Comfy's image composition example it runs SD-1.4 based checkpoints on a starting latent at 1280x708 which intializes a background to place three characters (also as partly generated latents @ 256x512) to makes sure they all progress towards the prompts they were given, then some more iterations are run on the combined version before the latents themselves are upscaled to target 1920x1080 final resolution and the last 9 iterations of the checkpoint are run on these latents and sent through a normal VAE. It takes a long time, both because directml is much slower to begin with at a given resolution and because it's using all 24GB of vram at least partially due to an outstanding DirectML bug on AMD that makes it non-trivial to get in-use memory and alloc / dealloc properly, but it's proof that there's nothing model-side preventing this from working.
AFAIK in something like automatic1111 the included tiled generation plugin is there so people with low vram can still generate large images, not because large latents couldn't be processed as-is. I think diffusers just handles the tiling itself but on the card so it'll be much faster if you have enough memory to store it all at once.
Anyway, If I try to use a .safetensors for one of the rare models trained starting from stable-diffusion-2 or just the base 768-v-ema.ckpt as a checkpoint @ 768x768, something messes up with shark when it's determining which CLIP encoder it should pull and the process errors out with:
Some weights of the model checkpoint at laion/CLIP-ViT-bigG-14-laion2B-39B-b160k were not used when initializing CLIPTextModelWithProjection {huge list of layers snipped} RuntimeError: Error(s) in loading state_dict for CLIPTextModelWithProjection: {huge list of missing keys snipped} then finally a giant mix of size mismatches snipped that all center around the model wanting 1280 as a dimension but the CLIP model using 1024...
Now oddly when loading the model straight from huggingface that doesn't occur, but I get noise that looks suspiciously like there's some image in it running the untuned model. From what I could tell this is due to issues with the original stable-diffusion-2 VAE being fp32-only. Supposedly those need to run in fp32 to avoid black images in normal UIs (CUDA, primarily) so there's a chance whatever was causing black images in those was causing noise in shark. I don't know. I'm going to experiment with some different VAEs and see if I can find one that works at higher resolutions. There's an fp16 vae for 2 and 2.1 now, and the sdxl version was released.
Here is the terminal output for this run on W7900:
And a screenshot of the webui with the generated noise image: