huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.11k stars 5.18k forks source link

Come on, come on, let's adapt the conversion script to SD 2.0 #1388

Closed piEsposito closed 1 year ago

piEsposito commented 1 year ago

Is your feature request related to a problem? Please describe. It would be great if we could run SD 2 with cpu_offload, attention slicing, xformers, etc...

Describe the solution you'd like Adapt the conversion script to SD 2.0

Describe alternatives you've considered Stability AI's repo is not as flexible.

averad commented 1 year ago

🤗 Diffusers with Stable Diffusion 2 is live!

anton-l commented (https://github.com/huggingface/diffusers/issues/1388#issuecomment-1327731012) diffusers==0.9.0 with Stable Diffusion 2 are live!

Installation pip install diffusers[torch]==0.9 transformers

Release Information https://github.com/huggingface/diffusers/releases/tag/v0.9.0

Contributors @kashif @pcuenca @patrickvonplaten @anton-l @patil-suraj

📰 News

✏️ Notes & Information

Related huggingface/diffusers Pull Requests:

👇 Quick Links:

👁️ User Submitted Resources:

💭 User Story (Prior to Huggingface Diffusers 0.9.0 Release)

Stability-AI has released Stable Diffusion 2.0 models/workflow. When you run convert_original_stable_diffusion_to_diffusers.py on the new Stability-AI/stablediffusion models the following errors occur.

convert_original_stable_diffusion_to_diffusers.py --checkpoint_path="./512-inpainting-ema.ckpt" --dump_path="./512-inpainting-ema_diffusers"

Output:

Traceback (most recent call last):
File "convert_original_stable_diffusion_to_diffusers.py", line 720, in <module> 
        unet.load_state_dict(converted_unet_checkpoint)
File "lib\site-packages\torch\nn\modules\module.py", line 1667, in load_state_dict
        raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for UNet2DConditionModel:
        size mismatch for down_blocks.0.attentions.0.proj_in.weight: copying a param with shape torch.Size([320, 320]) from checkpoint, the shape in current model is torch.Size([320, 320, 1, 1]).
        size mismatch for down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_k.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([320, 768]).
        size mismatch for down_blocks.0.attentions.0.transformer_blocks.0.attn2.to_v.weight: copying a param with shape torch.Size([320, 1024]) from checkpoint, the shape in current model is torch.Size([320, 768]).
        size mismatch for down_blocks.0.attentions.0.proj_out.weight: copying a param with shape torch.Size([320, 320]) from checkpoint, the shape in current model is torch.Size([320, 320, 1, 1]).
.... blocks.1.attentions blocks.2.attentions etc. etc.
devilismyfriend commented 1 year ago

trying to but likely I won't be able to do it lol

0xdevalias commented 1 year ago

Semi-Related:

devilismyfriend commented 1 year ago

after looking at it I'm not sure it has anything to do with the script, seems like the u2net on diffusers needs to have 4 dimensions on the tensor size.

AugmentedRealityCat commented 1 year ago

needs to have 4 dimensions

So I guess this will take time...

devilismyfriend commented 1 year ago

needs to have 4 dimensions

So I guess this will take time...

maybe not, I'm not that knowledgeable on the subject but I assume a unet2D needs to be 4D, or maybe you can just artificially add it idk

0xdevalias commented 1 year ago

rudimentary support for stable diffusion 2.0

https://github.com/MrCheeze/stable-diffusion-webui/commit/069591b06bbbdb21624d489f3723b5f19468888d

Originally posted by @152334H in https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/5011#issuecomment-1325971596

hafriedlander commented 1 year ago

https://github.com/hafriedlander/diffusers/blob/stable_diffusion_2/scripts/convert_original_stable_diffusion_to_diffusers.py

Notes:

hafriedlander commented 1 year ago

Here's an example of accessing the penultimate text embedding layer https://github.com/hafriedlander/stable-diffusion-grpcserver/blob/b34bb27cf30940f6a6a41f4b77c5b77bea11fd76/sdgrpcserver/pipeline/text_embedding/basic_text_embedding.py#L33

devilismyfriend commented 1 year ago

https://github.com/hafriedlander/diffusers/blob/stable_diffusion_2/scripts/convert_original_stable_diffusion_to_diffusers.py

Notes:

  • Only tested on the two txt2img models, not inpaint / depth2img / upscaling
  • You will need to change your text embedding to use the penultimate layer too
  • It spits out a bunch of warnings about vision_model, but that's fine
  • I have no idea if this is right or not. It generates images, no guarantee beyond that. (Hence no PR - if you're patient, I'm sure the Diffusers team will do a better job than I have)

doesn't seem to work for me on the 768-v model using the v2 config for v

TypeError: EulerDiscreteScheduler.init() got an unexpected keyword argument 'prediction_type'

CoffeeVampir3 commented 1 year ago

Appears I'm also having unexpected argument error, but of a different arg:

Command:

python convert.py --checkpoint_path="models/512-base-ema.ckpt" --dump_path="outputs/" --original_config_file="v2-inference.yaml"

Result:

│ 736 │ unet = UNet2DConditionModel(**unet_config) │ │ 737 │ unet.load_state_dict(converted_unet_checkpoint)
TypeError: init() got an unexpected keyword argument 'use_linear_projection'

I can't seem to find a resolution to this one.

hafriedlander commented 1 year ago

You need to use the absolute latest Diffusers and merge this PR (or use my branch which has it in it) https://github.com/huggingface/diffusers/pull/1386

hafriedlander commented 1 year ago

(My branch is at https://github.com/hafriedlander/diffusers/tree/stable_diffusion_2)

patrickvonplaten commented 1 year ago

Amazing to see the excitement here! We'll merge #1386 in a bit :-)

hafriedlander commented 1 year ago

@patrickvonplaten the problems I've run into so far:

patrickvonplaten commented 1 year ago

That's super helpful @hafriedlander - thanks!

BTW, weights for the 512x512 are up:

Looking into the 768x768 model now

hafriedlander commented 1 year ago

Nice. Do you have a solution in mind for how to flag to the pipeline to use the penultimate layer in the CLIP model? (I just pass it in as an option at the moment)

patrickvonplaten commented 1 year ago

Can you send me a link? Does the pipeline not work out of the box? cc @anton-l @patil-suraj

hafriedlander commented 1 year ago

It works but I don't think it's correct. The Stability configuration files explicitly say to use the penultimate CLIP layer https://github.com/Stability-AI/stablediffusion/blob/33910c386eaba78b7247ce84f313de0f2c314f61/configs/stable-diffusion/v2-inference-v.yaml#L68

hafriedlander commented 1 year ago

It's relatively easy to get access to the penultimate layer. I do it in my custom pipeline like this:

https://github.com/hafriedlander/stable-diffusion-grpcserver/blob/b34bb27cf30940f6a6a41f4b77c5b77bea11fd76/sdgrpcserver/pipeline/text_embedding/basic_text_embedding.py#L33

The problem is knowing when to do it and when not to.

patrickvonplaten commented 1 year ago

I see! Thanks for the links - so they do this for both the 512x512 SD 2 and 768x768 SD 2 model?

hafriedlander commented 1 year ago

Both

hafriedlander commented 1 year ago

It's a technique NovelAI discovered FYI (https://blog.novelai.net/novelai-improvements-on-stable-diffusion-e10d38db82ac)

patrickvonplaten commented 1 year ago

Actually @patil-suraj solved it pretty cleanly by just removing the last layer: https://huggingface.co/stabilityai/stable-diffusion-2-inpainting/blob/main/text_encoder/config.json#L19

So this works out of the box

patrickvonplaten commented 1 year ago

Notice the difference between: https://huggingface.co/stabilityai/stable-diffusion-2-inpainting/blob/main/text_encoder/config.json#L19 and https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K/blob/main/config.json#L54

hafriedlander commented 1 year ago

Ah, nice. Yeah, that's cleaner.

averad commented 1 year ago

768x768 weights released:

fp16 and other versions of the models appear to being being worked on and uploaded.

0xdevalias commented 1 year ago

testing in progress on the horde https://github.com/Sygil-Dev/nataili/tree/v2 try it out Stable Diffusion 2.0 on our UI's

https://tinybots.net/artbot https://aqualxx.github.io/stable-ui/ https://dbzer0.itch.io/lucid-creations

https://sigmoid.social/@stablehorde/109398715339480426

SD 2.0

  • [x] Initial implementation ready for testing
  • [ ] img2img
  • [ ] inpainting
  • [ ] k_diffusers support

Originally posted by @AlRlC in https://github.com/Sygil-Dev/nataili/issues/67#issuecomment-1326385645

0xdevalias commented 1 year ago

Originally posted by @0xdevalias in https://github.com/TheLastBen/fast-stable-diffusion/issues/599#issuecomment-1326446674

0xdevalias commented 1 year ago

Should work now, make sure you check the box "redownload original model" when choosing V2

https://colab.research.google.com/github/TheLastBen/fast-stable-diffusion/blob/main/fast_stable_diffusion_AUTOMATIC1111.ipynb

Requires more than 12GB of RAM for now, so free colab probably won't suffice.

Originally posted by @TheLastBen in https://github.com/TheLastBen/fast-stable-diffusion/issues/599#issuecomment-1326461962

hamzafar commented 1 year ago

Yes, stable_diffusion2 is working now. And the few lines of code to get inference is in here: https://colab.research.google.com/drive/1Na9x7w7RSbk2UFbcnrnuurg7kFGeqBsa?usp=sharing

devilismyfriend commented 1 year ago

I assume the convert diffusers to SD ckpt will need an update as well?

TheLastBen commented 1 year ago

I assume the convert diffusers to SD ckpt will need an update as well?

Nope

hafriedlander commented 1 year ago

@patrickvonplaten how sure are you that your conversion is correct? I'm trying to diagnose a difference I get between your 768 weights and my conversion script. There's a big difference, and in general I much prefer the results from my conversion. It seems specific to the unet - if I replace my unet with yours I get the same results.

hafriedlander commented 1 year ago

OK, differential diagnostic done, it's the Tokenizer. How did you create the Tokenizer at https://huggingface.co/stabilityai/stable-diffusion-2/tree/main/tokenizer? I just built a Tokenizer using AutoTokenizer.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K") - it seems to give much better results.

0xdevalias commented 1 year ago

Yes, stable_diffusion2 is working now. And the few lines of code to get inference is in here: colab.research.google.com/drive/1Na9x7w7RSbk2UFbcnrnuurg7kFGeqBsa?usp=sharing

@hamzafar In one of the last cells (that sets up EulerDiscreteScheduler) the following warning is shown. I wonder if things would work differently/better if ftfy or spacy was installed alongside the other requirements?

ftfy or spacy is not installed using BERT BasicTokenizer instead of ftfy.
0xdevalias commented 1 year ago

From @pcuenca on the HF discord:

We are busy preparing a new release of diffusers to fully support Stable Diffusion 2. We are still ironing things out, but the basics already work from the main branch in github. Here's how to do it:

  • Install diffusers from github alongside its dependencies:
pip install --upgrade git+https://github.com/huggingface/diffusers.git transformers accelerate scipy
  • Use the code in this script to run your predictions:
from diffusers import DiffusionPipeline, EulerDiscreteScheduler
import torch

repo_id = "stabilityai/stable-diffusion-2"
device = "cuda"

scheduler = EulerDiscreteScheduler.from_pretrained(repo_id, subfolder="scheduler", prediction_type="v_prediction")
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16", scheduler=scheduler)
pipe = pipe.to(device)

prompt = "High quality photo of an astronaut riding a horse in space"
image = pipe(prompt, width=768, height=768, guidance_scale=9).images[0]
image.save("astronaut.png")

Originally posted by @vvvm23 in https://github.com/huggingface/diffusers/issues/1392#issuecomment-1326747275

hafriedlander commented 1 year ago

I've put "my" version of the Tokenizer at https://huggingface.co/halffried/sd2-laion-clipH14-tokenizer/tree/main. You can just replace the tokenizer in any pipeline to test it if you're interested.

0xdevalias commented 1 year ago

How did you create the Tokenizer at huggingface.co/stabilityai/stable-diffusion-2/tree/main/tokenizer?

@hafriedlander Given that is the official stabilityai repo, presumably noone here in huggingface/diffusers made it, and that was just what was released with SDv2?

hafriedlander commented 1 year ago

@0xdevalias not sure. @patrickvonplaten said that the penultimate layer fix was invented by @patil-suraj, who's a HuggingFace person, not a Stability person. Anyway, I'm not saying mine is correct or anything, just that, in the limited testing I've done, I like the result way more, and that's weird.

patil-suraj commented 1 year ago

OK, differential diagnostic done, it's the Tokenizer. How did you create the Tokenizer at https://huggingface.co/stabilityai/stable-diffusion-2/tree/main/tokenizer? I just built a Tokenizer using AutoTokenizer.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K") - it seems to give much better results.

Thanks, will take a look. Also, could you post some results here so we could see the differences ? I'm compare the results with original repo and they seemed to match, I'll take a look again.

patil-suraj commented 1 year ago

Also could you post the prompts that gave you bad results ?

hafriedlander commented 1 year ago

The whole model seems very sensitive to style shifts.

https://imgur.com/a/dUb93fD is three images with the standard tokenizer. The prompt for the first is

"A full portrait of a teenage smiling, beautiful post apocalyptic female princess, intricate, elegant, highly detailed, digital painting, artstation, smooth, sharp focus, illustration, art by krenz cushart and artem demura and alphonse mucha"

The prompt for the second is exactly the same, but with the addition of a negative prompt "bad teeth, missing teeth"

The third is the first prompt, but without the word smiling

Here is the same with my version of the tokenizer https://imgur.com/a/Wr5Sw9P

The second version with the original tokenizer is great. But I would not normally expect to see a big shift in quality from the addition of a negative prompt like that.

I'll track down another of my recent prompts where I much preferred my tokenizer, and see if adding a negative prompt helps.

patil-suraj commented 1 year ago

Thank you! Will also compare using these prompts.

patil-suraj commented 1 year ago

I noticed one difference, the original open_clip tokenizer that is used to train SD2 uses 0 as pad_token_id, while the AutoTokenizer that you posted uses 49407. So the current tokenizer matches the original implementation, we can verify it using the code below.

from transformers import CLIPTokenizer, AutoTokenizer
from open_clip import tokenize

tok = CLIPTokenizer.from_pretrained("stabilityai/stable-diffusion-2", subfolder="tokenizer")
tok2 = AutoTokenizer.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K") 

prompt = "A full portrait of a teenage smiling, beautiful post apocalyptic female princess, intricate, elegant, highly detailed, digital painting, artstation, smooth, sharp focus, illustration, art by krenz cushart and artem demura and alphonse mucha"

tok_orig = tokenize(prompt)
tok_current = tok(prompt, padding="max_length", max_length=77, return_tensors="pt").input_ids
tok_auto = tok2(prompt, padding="max_length", max_length=77, return_tensors="pt", truncation=True).input_ids

assert torch.all(tok_orig == tok_current) # True
assert torch.all(tok_orig == tok_auto) # False

cc @patrickvonplaten

anton-l commented 1 year ago

diffusers==0.9.0 with Stable Diffusion 2 is live! https://github.com/huggingface/diffusers/releases/tag/v0.9.0

hamzafar commented 1 year ago

Yes, stable_diffusion2 is working now. And the few lines of code to get inference is in here: colab.research.google.com/drive/1Na9x7w7RSbk2UFbcnrnuurg7kFGeqBsa?usp=sharing

@hamzafar In one of the last cells (that sets up EulerDiscreteScheduler) the following warning is shown. I wonder if things would work differently/better if ftfy or spacy was installed alongside the other requirements?

ftfy or spacy is not installed using BERT BasicTokenizer instead of ftfy.

@0xdevalias I have generated images with and without ftfy. I can't observe any difference in the results: https://colab.research.google.com/drive/1Na9x7w7RSbk2UFbcnrnuurg7kFGeqBsa?usp=sharing

patrickvonplaten commented 1 year ago

Sorry the warning is misleading and coming from transformers - you can safely ignore it. I'll try to fix it in Transformers

0xdevalias commented 1 year ago

when will Dreambooth support sd2

While it's not dreambooth, this repo seems to have support for finetuning SDv2:

Originally posted by @0xdevalias in https://github.com/JoePenna/Dreambooth-Stable-Diffusion/issues/112#issuecomment-1327993709


And looking at the huggingface/diffusers repo, there are a few issues that seem to imply people may be getting dreambooth things working with that (or at least trying to), eg.:

Originally posted by @0xdevalias in https://github.com/JoePenna/Dreambooth-Stable-Diffusion/issues/112#issuecomment-1327998619

vvsotnikov commented 1 year ago

UPDATE: the issue is gone with the newer build of xformers

Hi, I'm using diffusers==0.9.0 and xformers==0.0.15.dev0+1515f77.d20221129, and for me, xformers makes SD 2.0 roughly x1.5 slower with xformers than without it (while it indeed saves some VRAM). At the same time, SD 1.5 runs about x1.5 faster with xformers, so it's unlikely that there's something wrong with my setup :) Is it a known issue? Here are some code samples to reproduce the issue:

# SD2, xformers disabled -> 5.02it/s
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
repo_id = "stabilityai/stable-diffusion-2"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
pipe.disable_xformers_memory_efficient_attention()
prompt = "An oil painting of white De Tomaso Pantera parked in the forest by Ivan Shishkin"
image = pipe(prompt, guidance_scale=9, num_inference_steps=25).images[0]  # warmup
image = pipe(prompt, guidance_scale=9, num_inference_steps=250, width=1024, height=576).images[0]
Fetching 12 files: 100%|##########| 12/12 [00:00<00:00, 52648.17it/s]
100%|##########| 25/25 [00:05<00:00,  4.70it/s]
100%|##########| 250/250 [00:49<00:00,  5.02it/s]
# SD2, xformers enabled ->  2.93it/s
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
repo_id = "stabilityai/stable-diffusion-2"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
pipe.enable_xformers_memory_efficient_attention()  # explicitly enable xformers just in case
prompt = "An oil painting of white De Tomaso Pantera parked in the forest by Ivan Shishkin"
image = pipe(prompt, guidance_scale=9, num_inference_steps=25).images[0]  # warmup
image = pipe(prompt, guidance_scale=9, num_inference_steps=250, width=1024, height=576).images[0]
Fetching 12 files: 100%|##########| 12/12 [00:00<00:00, 43804.74it/s]
100%|##########| 25/25 [00:08<00:00,  2.90it/s]
100%|##########| 250/250 [01:25<00:00,  2.93it/s]
# SD1.5, xformers disabled -> 5.66it/s
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, revision="fp16")
pipe = pipe.to("cuda")
pipe.disable_xformers_memory_efficient_attention() 
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]  
image = pipe(prompt, width=880, num_inference_steps=150).images[0]  
Fetching 15 files: 100%|##########| 15/15 [00:00<00:00, 56987.83it/s]
100%|##########| 51/51 [00:04<00:00, 10.85it/s]
100%|##########| 151/151 [00:26<00:00,  5.66it/s]

# SD1.5, xformers enabled -> 7.94it/s
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, revision="fp16")
pipe = pipe.to("cuda")
pipe.enable_xformers_memory_efficient_attention()
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]  
image = pipe(prompt, width=880, num_inference_steps=150).images[0]  
Fetching 15 files: 100%|##########| 15/15 [00:00<00:00, 54660.78it/s]
100%|##########| 51/51 [00:04<00:00, 12.42it/s]
100%|##########| 151/151 [00:19<00:00,  7.94it/s]