huggingface / diffusers

šŸ¤— Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.17k stars 5.2k forks source link

[Community] Implement `prompt-to-prompt` pipelines #2121

Open apolinario opened 1 year ago

apolinario commented 1 year ago

Describe the solution you'd like Now that we have an official way to tweak cross attention https://github.com/huggingface/diffusers/pull/1639 , would be great to have a pipeline (be it official or community) for prompt-to-prompt and further implementations of the technique (such as EDICT).

Describe alternatives you've considered @amirhertz official Prompt-to-Prompt implementation is built on top of diffusers 0.3.0 with their own cross attention manipulation function. @bloc97 community prompt-to-prompt implementation already uses diffusers, but it is pinned to version 0.4.1, also with a cross attention control of their own. @bram-w / Salesforce EDICT , that adds inversion to prompt-to-prompt (allowing you to edit real images) also uses the above as a base with some modifications for double precision for inversion.

So while alternatives exist, they require users to pin old versions of diffusers and not enjoy the latest advancements. Given this technique is very useful, having it on a pipeline within diffusers could be really great. Also could potentially leverage the technique to other models (Karlo, IF, etc.)

Additional context InstructPix2Pix and Imagic have shown how editing real and edited images is a trend. Prompt-to-prompt is a nice tool to have on that belt for practitioners, artists and professionals.

patrickvonplaten commented 1 year ago

+100 - just lacking the time at the moment. I wonder whether we should do a community sprint in a week or so trying to add the most important "tweak your text prompts" pipelines.

patrickvonplaten commented 1 year ago

Actually taking this as an opportunity to turn the feature request into a more precise explanation of how it can be added.

In short we have now all the necessary tools to add a Pipeline like Prompt-2-prompt in a nice & clean way. Prompt-2-prompt is an official pipeline with paper release and 1k+ stars, so IMO we should put it in src/diffusers/pipelines

What you'll need to do:

Very keen on guiding someone from the community through a PR, but currently don't find the time to do it

keturn commented 1 year ago

You may also reference InvokeAI's update for the diffusers 0.12 attention API: https://github.com/invoke-ai/InvokeAI/pull/2385

A few caveats:

kashif commented 1 year ago

So I had the following attention processors in mind for this variant of the prompt-to-prompt: https://github.com/cccntu/efficient-prompt-to-prompt

class CrossAttnKVProcessor:
    def __call__(
        self, attn: CrossAttention, hidden_states, key_hidden_states=None, value_hidden_state=None, attention_mask=None
    ):
        _, sequence_length, _ = hidden_states.shape
        attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length)

        query = attn.to_q(hidden_states)
        query = attn.head_to_batch_dim(query)

        key_hidden_states = key_hidden_states if key_hidden_states is not None else hidden_states
        value_hidden_state = value_hidden_state if value_hidden_state is not None else hidden_states
        key = attn.to_k(key_hidden_states)
        value = attn.to_v(value_hidden_state)
        key = attn.head_to_batch_dim(key)
        value = attn.head_to_batch_dim(value)

        attention_probs = attn.get_attention_scores(query, key, attention_mask)
        hidden_states = torch.bmm(attention_probs, value)
        hidden_states = attn.batch_to_head_dim(hidden_states)

        # linear proj
        hidden_states = attn.to_out[0](hidden_states)
        # dropout
        hidden_states = attn.to_out[1](hidden_states)

        return hidden_states

class XFormersCrossAttnKVProcessor:
    def __call__(
        self, attn: CrossAttention, hidden_states, key_hidden_states=None, value_hidden_state=None, attention_mask=None
    ):
        _, sequence_length, _ = hidden_states.shape

        attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length)

        query = attn.to_q(hidden_states)

        key_hidden_states = key_hidden_states if key_hidden_states is not None else hidden_states
        value_hidden_state = value_hidden_state if value_hidden_state is not None else hidden_states
        key = attn.to_k(key_hidden_states)
        value = attn.to_v(value_hidden_state)

        query = attn.head_to_batch_dim(query).contiguous()
        key = attn.head_to_batch_dim(key).contiguous()
        value = attn.head_to_batch_dim(value).contiguous()

        hidden_states = xformers.ops.memory_efficient_attention(query, key, value, attn_bias=attention_mask)
        hidden_states = hidden_states.to(query.dtype)
        hidden_states = attn.batch_to_head_dim(hidden_states)

        # linear proj
        hidden_states = attn.to_out[0](hidden_states)
        # dropout
        hidden_states = attn.to_out[1](hidden_states)

        return hidden_states

class SlicedAttnKVProcessor:
    def __init__(self, slice_size):
        self.slice_size = slice_size

    def __call__(
        self, attn: CrossAttention, hidden_states, key_hidden_states=None, value_hidden_state=None, attention_mask=None
    ):
        _, sequence_length, _ = hidden_states.shape

        attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length)

        query = attn.to_q(hidden_states)
        dim = query.shape[-1]
        query = attn.head_to_batch_dim(query)

        key_hidden_states = key_hidden_states if key_hidden_states is not None else hidden_states
        value_hidden_state = value_hidden_state if value_hidden_state is not None else hidden_states
        key = attn.to_k(key_hidden_states)
        value = attn.to_v(value_hidden_state)
        key = attn.head_to_batch_dim(key)
        value = attn.head_to_batch_dim(value)

        batch_size_attention = query.shape[0]
        hidden_states = torch.zeros(
            (batch_size_attention, sequence_length, dim // attn.heads), device=query.device, dtype=query.dtype
        )

        for i in range(hidden_states.shape[0] // self.slice_size):
            start_idx = i * self.slice_size
            end_idx = (i + 1) * self.slice_size

            query_slice = query[start_idx:end_idx]
            key_slice = key[start_idx:end_idx]
            attn_mask_slice = attention_mask[start_idx:end_idx] if attention_mask is not None else None

            attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)

            attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx])

            hidden_states[start_idx:end_idx] = attn_slice

        hidden_states = attn.batch_to_head_dim(hidden_states)

        # linear proj
        hidden_states = attn.to_out[0](hidden_states)
        # dropout
        hidden_states = attn.to_out[1](hidden_states)

        return hidden_states
patrickvonplaten commented 1 year ago
        self, attn: CrossAttention, hidden_states, key_hidden_states=None, value_hidd

Sure, this seems reasonable, guess would be great to see it in a pipeline class directly :-)

unography commented 1 year ago

Is this open? Would be happy to take it up!

kashif commented 1 year ago

@unography yes it's open, please feel free to contribute!

unography commented 1 year ago

@kashif sure, will add a draft PR soon

evinpinar commented 1 year ago

This looks plausible thanks! Furthermore, with the xformers implementation, how can we retrieve softmaxed k*q attention map (before applying to values)? See here: https://github.com/facebookresearch/xformers/blob/5df1f0b682a5b246577f0cf40dd3b15c1a04ce50/xformers/ops/fmha/__init__.py#L149

class XFormersCrossAttnKVProcessor:
    def __call__(
        self, attn: CrossAttention, hidden_states, key_hidden_states=None, value_hidden_state=None, attention_mask=None
    ):
        _, sequence_length, _ = hidden_states.shape

        attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length)

        query = attn.to_q(hidden_states)

        key_hidden_states = key_hidden_states if key_hidden_states is not None else hidden_states
        value_hidden_state = value_hidden_state if value_hidden_state is not None else hidden_states
        key = attn.to_k(key_hidden_states)
        value = attn.to_v(value_hidden_state)

        query = attn.head_to_batch_dim(query).contiguous()
        key = attn.head_to_batch_dim(key).contiguous()
        value = attn.head_to_batch_dim(value).contiguous()

        hidden_states = xformers.ops.memory_efficient_attention(query, key, value, attn_bias=attention_mask)
        hidden_states = hidden_states.to(query.dtype)
        hidden_states = attn.batch_to_head_dim(hidden_states)
Alchete commented 1 year ago

Taking a step back -- I question the actual usefulness of "prompt-to-prompt". Why would someone generate an image with the wrong prompt in the first place?? If I wanted a "box of cookies", why did I type "box of apples"?

Plus, there are now more powerful and flexible techniques available. The paper below requires no input prompt, just a raw image, from which it extracts various features from the diffusion layers and applies them to a new prompt. This seems much more in line with a normal image workflow than prompt-to-prompt. Cheers.

https://arxiv.org/pdf/2211.12572.pdf

evinpinar commented 1 year ago

If useful for anyone, I've implemented an Attend-to-Excite with the AttentionProcessors, an example is here: https://github.com/evinpinar/Attend-and-Excite-diffusers/blob/72fa567a1e3bb3cc1b63cb53a1d9db5fc10b241f/utils/ptp_utils.py#L57


class AttendExciteCrossAttnProcessor:

    def __init__(self, attnstore, place_in_unet):
        super().__init__()
        self.attnstore = attnstore
        self.place_in_unet = place_in_unet

    def __call__(self, attn: CrossAttention, hidden_states, encoder_hidden_states=None, attention_mask=None):
        batch_size, sequence_length, _ = hidden_states.shape
        attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length)

        query = attn.to_q(hidden_states)

        is_cross = encoder_hidden_states is not None
        encoder_hidden_states = encoder_hidden_states if encoder_hidden_states is not None else hidden_states
        key = attn.to_k(encoder_hidden_states)
        value = attn.to_v(encoder_hidden_states)

        query = attn.head_to_batch_dim(query)
        key = attn.head_to_batch_dim(key)
        value = attn.head_to_batch_dim(value)

        attention_probs = attn.get_attention_scores(query, key, attention_mask)

        self.attnstore(attention_probs, is_cross, self.place_in_unet)

        hidden_states = torch.bmm(attention_probs, value)
        hidden_states = attn.batch_to_head_dim(hidden_states)

        # linear proj
        hidden_states = attn.to_out[0](hidden_states)
        # dropout
        hidden_states = attn.to_out[1](hidden_states)

        return hidden_states

def register_attention_control(model, controller):

    attn_procs = {}
    cross_att_count = 0
    for name in model.unet.attn_processors.keys():
        cross_attention_dim = None if name.endswith("attn1.processor") else model.unet.config.cross_attention_dim
        if name.startswith("mid_block"):
            hidden_size = model.unet.config.block_out_channels[-1]
            place_in_unet = "mid"
        elif name.startswith("up_blocks"):
            block_id = int(name[len("up_blocks.")])
            hidden_size = list(reversed(model.unet.config.block_out_channels))[block_id]
            place_in_unet = "up"
        elif name.startswith("down_blocks"):
            block_id = int(name[len("down_blocks.")])
            hidden_size = model.unet.config.block_out_channels[block_id]
            place_in_unet = "down"
        else:
            continue

        cross_att_count += 1
        attn_procs[name] = AttendExciteCrossAttnProcessor(
            attnstore=controller, place_in_unet=place_in_unet
        )

    model.unet.set_attn_processor(attn_procs)
    controller.num_att_layers = cross_att_count
patrickvonplaten commented 1 year ago

Super cool! @evinpinar feel free to open a PR to add this as a new pipeline. Maybe this PR is a good example of how to add a new simple pipeline: https://github.com/huggingface/diffusers/pull/2223

Amazing work :heart:

isamu-isozaki commented 1 year ago

@evinpinar Looks awesome!

isamu-isozaki commented 1 year ago

Btw, is there a pr like this for the prompt to prompt? I just want to check out the implementation for research. If not happy to make one based on @evinpinar code

Weifeng-Chen commented 1 year ago

Hi, everyone, I just implement a pipeline here. https://github.com/Weifeng-Chen/prompt2prompt base on @evinpinar code. Borrow from google's prompt-to-prompt. using a 'controller' to replace, refine, or reweight. the controller now is outside the pipelin, don't know whether to put it to the pipeline. here's a reference code now, any advice for the api?

from pipeline_prompt2prompt import Prompt2PromptPipeline
from ptp_utils import AttentionStore, AttentionReplace, LocalBlend, AttentionRefine, AttentionReweight, view_images, get_equalizer
import torch
import numpy as np

g_cpu = torch.Generator().manual_seed(2333)
device = "cuda"

pipe = Prompt2PromptPipeline.from_pretrained("CompVis/stable-diffusion-v1-4" ).to(device)

prompts = ["A painting of a squirrel eating a burger",
           "A painting of a cat eating a burger"]

NUM_DIFFUSION_STEPS = 20
lb = LocalBlend(prompts, ("squirrel", "cat"), tokenizer=pipe.tokenizer, device=pipe.device)
controller = AttentionReplace(prompts, NUM_DIFFUSION_STEPS, cross_replace_steps=.4, self_replace_steps=0.4, tokenizer=pipe.tokenizer, device=pipe.device, local_blend=lb)
outputs = pipe(prompt=prompts, height=512, width=512, num_inference_steps=NUM_DIFFUSION_STEPS,
                controller=controller, generator=g_cpu,)
view_images( [np.array(img) for img in outputs.images] )

pipe.show_cross_attention(prompts, controller, res=16, from_where=("up", "down"), select=0)
pipe.show_cross_attention(prompts, controller, res=16, from_where=("up", "down"), select=1)

for more operation, have a look at https://github.com/Weifeng-Chen/prompt2prompt/blob/main/p2p_test.ipynb

isamu-isozaki commented 1 year ago

@Weifeng-Chen Thanks and awesome!

asadm commented 1 year ago

Thanks @Weifeng-Chen

I have a dumb question: when doing a refinement, what does self_replace_steps and cross_replace_steps mean actually?

Say I want to switch between two prompts: "A painting of a squirrel eating a burger" and "A real photo of a squirrel eating a burger" at 0.7. What values do I set to these two arguments in AttentionReplace(?

Weifeng-Chen commented 1 year ago

Thanks @Weifeng-Chen

I have a dumb question: when doing a refinement, what does self_replace_steps and cross_replace_steps mean actually?

Say I want to switch between two prompts: "A painting of a squirrel eating a burger" and "A real photo of a squirrel eating a burger" at 0.7. What values do I set to these two arguments in AttentionReplace(?

you can try: cross_replace_steps=0., self_replace_steps=0. means no replacement and totally generate a new image from scratch. I think, when inference, the new prompt will generate new cross-attn and self-attn maps, and replace it with the origin one. larger steps can let it more similar to the origin one but may restrict the editing. I didn't fully test it and point me out if I'm wrong.

asadm commented 1 year ago

But then where does 0.7 go? šŸ¤”

Weifeng-Chen commented 1 year ago

But then where does 0.7 go? šŸ¤”

you can try to change it. 0.7 means the first 70% steps using the origin prompt's attention and the rest 30% use the new one.

asadm commented 1 year ago

Yes that's what I am trying to achieve. So does that mean I set both to .7? cross_replace_steps=0.7, self_replace_steps=0.7

Thank you!

Weifeng-Chen commented 1 year ago

Yes that's what I am trying to achieve. So does that mean I set both to .7? cross_replace_steps=0.7, self_replace_steps=0.7

Thank you!

not necessary to be same. self-attn don't interact with the text.

Joqsan commented 1 year ago

Hi, I'd like to take on the EDICT implementation, if someone hasn't started it.

xvjiarui commented 1 year ago

Any updates in this thread? Looking forward to it!

patrickvonplaten commented 1 year ago

Note that we've already added the pix2pix0 pipeline which is an improved version of prompt2prompt: https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/pix2pix_zero

I'm not sure how much sense prompt2prompt makes given that an improved version has already been added.

ryan-caesar-ramos commented 1 year ago

I'm not necessarily pushing for it, but I will say that what methods like Prompt-to-Prompt and EDICT have over pix2pix zero is the lack of a need to generate source and target embeddings. In the case of editing real images, pix2pix zero would require you to not only undergo inversion steps, but also generate the source and target embeddings and get their difference before you can generate new images. With (the original) Prompt-to-Prompt paper as well as EDICT, you'd only need to undergo the inversion steps before generating the final images.

apolinario commented 1 year ago

I agree with @ryan-caesar-ramos , I think those serve different purposes and both could be part of a toolbox on diffusers. I think we would love a community contributed PR on p2p and EDICT!

unography commented 1 year ago

I was planning to work on this, but ended up using the pix2pix pipeline instead.

But like @apolinario and @ryan-caesar-ramos mentioned, it would be cool to have this. I'll work on p2p this week and raise a PR

apolinario commented 1 year ago

With the release soon(tm) of p2p-video, this gets even more relevant imo: https://video-p2p.github.io

bonlime commented 1 year ago

Do in understand correctly that adding prompt-to-prompt re-weightning is not that difficult now, but it's impossible to have it and xformers together, since we need to modify self-attention and xformers doesn't explicitly expose it?

patrickvonplaten commented 1 year ago

Anybody interested in picking up this feature request? Also related: https://github.com/huggingface/diffusers/issues/2979

init-22 commented 1 year ago

@Weifeng-Chen Nice implementation, I wonder why you didn't raised a PR yet!

Weifeng-Chen commented 1 year ago

@Weifeng-Chen Nice implementation, I wonder why you didn't raised a PR yet!

yeap, it is functional but not so elegant. I'm currently very busy and have no time to do it. maybe I'll do it if I got some time.

init-22 commented 1 year ago

Alright! Let me know if you have any idea on making it more elegant, I didnt run the code, just looked at it but maybe instead of creating the controller outside you can pass them in Prompt2PromptPipeline's call function along with one more parameter to pass the controller class type (string: AttentionReweight/AttentionRefine etc.) and then initialiIzing the Controller class inside it (in Prompt2PromptPipeline class)can make it slightly elegant

niclui commented 1 year ago

@Weifeng-Chen Thank you for your work! Will your pipeline be able to support batch_size > 1 (i.e. I can generate variants of >1 images at the same time)

UmerHA commented 1 year ago

Anybody interested in picking up this feature request? Also related: #2979

@patrickvonplaten If this is not urgent, I'd like to give it a try & would do it until end of July / start of August. I'm doing the fastai part 2 course and have done several contributions to other OSS projects (LangChain, gpt-engineer, ...). Solving this issues seems like a very cool learning goal. :)

patrickvonplaten commented 1 year ago

That would be great!

UmerHA commented 1 year ago

Quick update: I have started working on this, should be done in about a week

anvilarth commented 1 year ago

@UmerHA How can I contact to you to help with this pipeline?

UmerHA commented 1 year ago

@UmerHA How can I contact to you to help with this pipeline?

@UmerHAdil on Twitter or umerha on Discord. I'm getting close to being done. Will upload something into my branch today, so you can have a look.

UmerHA commented 1 year ago

I want to give an update: I'm getting close to being done.

The usage of this pipeline would look like this:

pipe = Prompt2PromptPipeline.from_pretrained("CompVis/stable-diffusion-v1-4").to("cuda")

prompts = ["A turtle playing with a ball",  "A monkey playing with a ball"]

edit_kwargs = {
    "cross_replace_steps": 0.4,
    "self_replace_steps": 0.4,
    "local_blend_words": ["turtle", "monkey"]
}

outputs = pipe(prompt=prompts, height=512, width=512, num_inference_steps=50, edit_type='replace', edit_kwargs=edit_kwargs)

See this file for examples of all edit types.

@patrickvonplaten I have three questions:

  1. Currently, the source and target images are produced simultaneously. In real-life usage, I would imagine they'd be created sequentially. I'd first create a source image, iterate through that until I'm happy with it, and only then edit it. This implies (a) saving the source image's attention maps to the pipe output object, and (b) accepting it as an optional parameter in __call__. Should we do that?

  2. Is integration testing with stable-diffusion-v1-4 and size 512 x 512 okay, or would you choose a smaller model / shape?

  3. Should we add code to visualize the averaged attention masks? Would that be used in practice? For reference, the paper and @Weifeng-Chen's code include that.


The current code can be found in this repo.

What's left to do:


Thanks for giving me the opportunity to do this! Have learned a lot. Appreciate it :)

yuanzhi-zhu commented 1 year ago

@evinpinar Thanks, your register_attention_control is great for diffusers!!