Open apolinario opened 1 year ago
+100 - just lacking the time at the moment. I wonder whether we should do a community sprint in a week or so trying to add the most important "tweak your text prompts" pipelines.
Actually taking this as an opportunity to turn the feature request into a more precise explanation of how it can be added.
In short we have now all the necessary tools to add a Pipeline like Prompt-2-prompt in a nice & clean way.
Prompt-2-prompt is an official pipeline with paper release and 1k+ stars, so IMO we should put it in src/diffusers/pipelines
What you'll need to do:
Prompt2PromptPipeline
that can be more or less copied from StableDiffusionPipeline
__call__
function. Those can be passed via cross_attention_kwargs={"swapped_out_embeds"=..., swapped_in_embeds=...}
to: https://github.com/huggingface/diffusers/blob/7436e30c720547c97b602155b8f5690976efdbc4/src/diffusers/models/unet_2d_condition.py#L403 which will then passed to your customized prompt-2-prompt attention processor.Very keen on guiding someone from the community through a PR, but currently don't find the time to do it
You may also reference InvokeAI's update for the diffusers 0.12 attention API: https://github.com/invoke-ai/InvokeAI/pull/2385
A few caveats:
So I had the following attention processors in mind for this variant of the prompt-to-prompt: https://github.com/cccntu/efficient-prompt-to-prompt
class CrossAttnKVProcessor:
def __call__(
self, attn: CrossAttention, hidden_states, key_hidden_states=None, value_hidden_state=None, attention_mask=None
):
_, sequence_length, _ = hidden_states.shape
attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length)
query = attn.to_q(hidden_states)
query = attn.head_to_batch_dim(query)
key_hidden_states = key_hidden_states if key_hidden_states is not None else hidden_states
value_hidden_state = value_hidden_state if value_hidden_state is not None else hidden_states
key = attn.to_k(key_hidden_states)
value = attn.to_v(value_hidden_state)
key = attn.head_to_batch_dim(key)
value = attn.head_to_batch_dim(value)
attention_probs = attn.get_attention_scores(query, key, attention_mask)
hidden_states = torch.bmm(attention_probs, value)
hidden_states = attn.batch_to_head_dim(hidden_states)
# linear proj
hidden_states = attn.to_out[0](hidden_states)
# dropout
hidden_states = attn.to_out[1](hidden_states)
return hidden_states
class XFormersCrossAttnKVProcessor:
def __call__(
self, attn: CrossAttention, hidden_states, key_hidden_states=None, value_hidden_state=None, attention_mask=None
):
_, sequence_length, _ = hidden_states.shape
attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length)
query = attn.to_q(hidden_states)
key_hidden_states = key_hidden_states if key_hidden_states is not None else hidden_states
value_hidden_state = value_hidden_state if value_hidden_state is not None else hidden_states
key = attn.to_k(key_hidden_states)
value = attn.to_v(value_hidden_state)
query = attn.head_to_batch_dim(query).contiguous()
key = attn.head_to_batch_dim(key).contiguous()
value = attn.head_to_batch_dim(value).contiguous()
hidden_states = xformers.ops.memory_efficient_attention(query, key, value, attn_bias=attention_mask)
hidden_states = hidden_states.to(query.dtype)
hidden_states = attn.batch_to_head_dim(hidden_states)
# linear proj
hidden_states = attn.to_out[0](hidden_states)
# dropout
hidden_states = attn.to_out[1](hidden_states)
return hidden_states
class SlicedAttnKVProcessor:
def __init__(self, slice_size):
self.slice_size = slice_size
def __call__(
self, attn: CrossAttention, hidden_states, key_hidden_states=None, value_hidden_state=None, attention_mask=None
):
_, sequence_length, _ = hidden_states.shape
attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length)
query = attn.to_q(hidden_states)
dim = query.shape[-1]
query = attn.head_to_batch_dim(query)
key_hidden_states = key_hidden_states if key_hidden_states is not None else hidden_states
value_hidden_state = value_hidden_state if value_hidden_state is not None else hidden_states
key = attn.to_k(key_hidden_states)
value = attn.to_v(value_hidden_state)
key = attn.head_to_batch_dim(key)
value = attn.head_to_batch_dim(value)
batch_size_attention = query.shape[0]
hidden_states = torch.zeros(
(batch_size_attention, sequence_length, dim // attn.heads), device=query.device, dtype=query.dtype
)
for i in range(hidden_states.shape[0] // self.slice_size):
start_idx = i * self.slice_size
end_idx = (i + 1) * self.slice_size
query_slice = query[start_idx:end_idx]
key_slice = key[start_idx:end_idx]
attn_mask_slice = attention_mask[start_idx:end_idx] if attention_mask is not None else None
attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx])
hidden_states[start_idx:end_idx] = attn_slice
hidden_states = attn.batch_to_head_dim(hidden_states)
# linear proj
hidden_states = attn.to_out[0](hidden_states)
# dropout
hidden_states = attn.to_out[1](hidden_states)
return hidden_states
self, attn: CrossAttention, hidden_states, key_hidden_states=None, value_hidd
Sure, this seems reasonable, guess would be great to see it in a pipeline class directly :-)
Is this open? Would be happy to take it up!
@unography yes it's open, please feel free to contribute!
@kashif sure, will add a draft PR soon
This looks plausible thanks! Furthermore, with the xformers implementation, how can we retrieve softmaxed k*q attention map (before applying to values)? See here: https://github.com/facebookresearch/xformers/blob/5df1f0b682a5b246577f0cf40dd3b15c1a04ce50/xformers/ops/fmha/__init__.py#L149
class XFormersCrossAttnKVProcessor: def __call__( self, attn: CrossAttention, hidden_states, key_hidden_states=None, value_hidden_state=None, attention_mask=None ): _, sequence_length, _ = hidden_states.shape attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length) query = attn.to_q(hidden_states) key_hidden_states = key_hidden_states if key_hidden_states is not None else hidden_states value_hidden_state = value_hidden_state if value_hidden_state is not None else hidden_states key = attn.to_k(key_hidden_states) value = attn.to_v(value_hidden_state) query = attn.head_to_batch_dim(query).contiguous() key = attn.head_to_batch_dim(key).contiguous() value = attn.head_to_batch_dim(value).contiguous() hidden_states = xformers.ops.memory_efficient_attention(query, key, value, attn_bias=attention_mask) hidden_states = hidden_states.to(query.dtype) hidden_states = attn.batch_to_head_dim(hidden_states)
Taking a step back -- I question the actual usefulness of "prompt-to-prompt". Why would someone generate an image with the wrong prompt in the first place?? If I wanted a "box of cookies", why did I type "box of apples"?
Plus, there are now more powerful and flexible techniques available. The paper below requires no input prompt, just a raw image, from which it extracts various features from the diffusion layers and applies them to a new prompt. This seems much more in line with a normal image workflow than prompt-to-prompt. Cheers.
If useful for anyone, I've implemented an Attend-to-Excite with the AttentionProcessors, an example is here: https://github.com/evinpinar/Attend-and-Excite-diffusers/blob/72fa567a1e3bb3cc1b63cb53a1d9db5fc10b241f/utils/ptp_utils.py#L57
class AttendExciteCrossAttnProcessor:
def __init__(self, attnstore, place_in_unet):
super().__init__()
self.attnstore = attnstore
self.place_in_unet = place_in_unet
def __call__(self, attn: CrossAttention, hidden_states, encoder_hidden_states=None, attention_mask=None):
batch_size, sequence_length, _ = hidden_states.shape
attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length)
query = attn.to_q(hidden_states)
is_cross = encoder_hidden_states is not None
encoder_hidden_states = encoder_hidden_states if encoder_hidden_states is not None else hidden_states
key = attn.to_k(encoder_hidden_states)
value = attn.to_v(encoder_hidden_states)
query = attn.head_to_batch_dim(query)
key = attn.head_to_batch_dim(key)
value = attn.head_to_batch_dim(value)
attention_probs = attn.get_attention_scores(query, key, attention_mask)
self.attnstore(attention_probs, is_cross, self.place_in_unet)
hidden_states = torch.bmm(attention_probs, value)
hidden_states = attn.batch_to_head_dim(hidden_states)
# linear proj
hidden_states = attn.to_out[0](hidden_states)
# dropout
hidden_states = attn.to_out[1](hidden_states)
return hidden_states
def register_attention_control(model, controller):
attn_procs = {}
cross_att_count = 0
for name in model.unet.attn_processors.keys():
cross_attention_dim = None if name.endswith("attn1.processor") else model.unet.config.cross_attention_dim
if name.startswith("mid_block"):
hidden_size = model.unet.config.block_out_channels[-1]
place_in_unet = "mid"
elif name.startswith("up_blocks"):
block_id = int(name[len("up_blocks.")])
hidden_size = list(reversed(model.unet.config.block_out_channels))[block_id]
place_in_unet = "up"
elif name.startswith("down_blocks"):
block_id = int(name[len("down_blocks.")])
hidden_size = model.unet.config.block_out_channels[block_id]
place_in_unet = "down"
else:
continue
cross_att_count += 1
attn_procs[name] = AttendExciteCrossAttnProcessor(
attnstore=controller, place_in_unet=place_in_unet
)
model.unet.set_attn_processor(attn_procs)
controller.num_att_layers = cross_att_count
Super cool! @evinpinar feel free to open a PR to add this as a new pipeline. Maybe this PR is a good example of how to add a new simple pipeline: https://github.com/huggingface/diffusers/pull/2223
Amazing work :heart:
@evinpinar Looks awesome!
Btw, is there a pr like this for the prompt to prompt? I just want to check out the implementation for research. If not happy to make one based on @evinpinar code
Hi, everyone, I just implement a pipeline here. https://github.com/Weifeng-Chen/prompt2prompt base on @evinpinar code. Borrow from google's prompt-to-prompt. using a 'controller' to replace, refine, or reweight. the controller now is outside the pipelin, don't know whether to put it to the pipeline. here's a reference code now, any advice for the api?
from pipeline_prompt2prompt import Prompt2PromptPipeline
from ptp_utils import AttentionStore, AttentionReplace, LocalBlend, AttentionRefine, AttentionReweight, view_images, get_equalizer
import torch
import numpy as np
g_cpu = torch.Generator().manual_seed(2333)
device = "cuda"
pipe = Prompt2PromptPipeline.from_pretrained("CompVis/stable-diffusion-v1-4" ).to(device)
prompts = ["A painting of a squirrel eating a burger",
"A painting of a cat eating a burger"]
NUM_DIFFUSION_STEPS = 20
lb = LocalBlend(prompts, ("squirrel", "cat"), tokenizer=pipe.tokenizer, device=pipe.device)
controller = AttentionReplace(prompts, NUM_DIFFUSION_STEPS, cross_replace_steps=.4, self_replace_steps=0.4, tokenizer=pipe.tokenizer, device=pipe.device, local_blend=lb)
outputs = pipe(prompt=prompts, height=512, width=512, num_inference_steps=NUM_DIFFUSION_STEPS,
controller=controller, generator=g_cpu,)
view_images( [np.array(img) for img in outputs.images] )
pipe.show_cross_attention(prompts, controller, res=16, from_where=("up", "down"), select=0)
pipe.show_cross_attention(prompts, controller, res=16, from_where=("up", "down"), select=1)
for more operation, have a look at https://github.com/Weifeng-Chen/prompt2prompt/blob/main/p2p_test.ipynb
@Weifeng-Chen Thanks and awesome!
Thanks @Weifeng-Chen
I have a dumb question: when doing a refinement, what does self_replace_steps
and cross_replace_steps
mean actually?
Say I want to switch between two prompts: "A painting of a squirrel eating a burger" and "A real photo of a squirrel eating a burger" at 0.7. What values do I set to these two arguments in AttentionReplace(
?
Thanks @Weifeng-Chen
I have a dumb question: when doing a refinement, what does
self_replace_steps
andcross_replace_steps
mean actually?Say I want to switch between two prompts: "A painting of a squirrel eating a burger" and "A real photo of a squirrel eating a burger" at 0.7. What values do I set to these two arguments in
AttentionReplace(
?
you can try: cross_replace_steps=0., self_replace_steps=0. means no replacement and totally generate a new image from scratch. I think, when inference, the new prompt will generate new cross-attn and self-attn maps, and replace it with the origin one. larger steps can let it more similar to the origin one but may restrict the editing. I didn't fully test it and point me out if I'm wrong.
But then where does 0.7
go? š¤
But then where does
0.7
go? š¤
you can try to change it. 0.7 means the first 70% steps using the origin prompt's attention and the rest 30% use the new one.
Yes that's what I am trying to achieve.
So does that mean I set both to .7? cross_replace_steps=0.7, self_replace_steps=0.7
Thank you!
Yes that's what I am trying to achieve. So does that mean I set both to .7?
cross_replace_steps=0.7, self_replace_steps=0.7
Thank you!
not necessary to be same. self-attn don't interact with the text.
Hi, I'd like to take on the EDICT implementation, if someone hasn't started it.
Any updates in this thread? Looking forward to it!
Note that we've already added the pix2pix0 pipeline which is an improved version of prompt2prompt: https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/pix2pix_zero
I'm not sure how much sense prompt2prompt makes given that an improved version has already been added.
I'm not necessarily pushing for it, but I will say that what methods like Prompt-to-Prompt and EDICT have over pix2pix zero is the lack of a need to generate source and target embeddings. In the case of editing real images, pix2pix zero would require you to not only undergo inversion steps, but also generate the source and target embeddings and get their difference before you can generate new images. With (the original) Prompt-to-Prompt paper as well as EDICT, you'd only need to undergo the inversion steps before generating the final images.
I agree with @ryan-caesar-ramos , I think those serve different purposes and both could be part of a toolbox on diffusers. I think we would love a community contributed PR on p2p and EDICT!
I was planning to work on this, but ended up using the pix2pix pipeline instead.
But like @apolinario and @ryan-caesar-ramos mentioned, it would be cool to have this. I'll work on p2p this week and raise a PR
With the release soon(tm) of p2p-video, this gets even more relevant imo: https://video-p2p.github.io
Do in understand correctly that adding prompt-to-prompt re-weightning is not that difficult now, but it's impossible to have it and xformers together, since we need to modify self-attention and xformers doesn't explicitly expose it?
Anybody interested in picking up this feature request? Also related: https://github.com/huggingface/diffusers/issues/2979
@Weifeng-Chen Nice implementation, I wonder why you didn't raised a PR yet!
@Weifeng-Chen Nice implementation, I wonder why you didn't raised a PR yet!
yeap, it is functional but not so elegant. I'm currently very busy and have no time to do it. maybe I'll do it if I got some time.
Alright! Let me know if you have any idea on making it more elegant, I didnt run the code, just looked at it but maybe instead of creating the controller outside you can pass them in Prompt2PromptPipeline's call function along with one more parameter to pass the controller class type (string: AttentionReweight/AttentionRefine etc.) and then initialiIzing the Controller class inside it (in Prompt2PromptPipeline class)can make it slightly elegant
@Weifeng-Chen Thank you for your work! Will your pipeline be able to support batch_size > 1 (i.e. I can generate variants of >1 images at the same time)
Anybody interested in picking up this feature request? Also related: #2979
@patrickvonplaten If this is not urgent, I'd like to give it a try & would do it until end of July / start of August. I'm doing the fastai part 2 course and have done several contributions to other OSS projects (LangChain, gpt-engineer, ...). Solving this issues seems like a very cool learning goal. :)
That would be great!
Quick update: I have started working on this, should be done in about a week
@UmerHA How can I contact to you to help with this pipeline?
@UmerHA How can I contact to you to help with this pipeline?
@UmerHAdil on Twitter or umerha on Discord. I'm getting close to being done. Will upload something into my branch today, so you can have a look.
I want to give an update: I'm getting close to being done.
The usage of this pipeline would look like this:
pipe = Prompt2PromptPipeline.from_pretrained("CompVis/stable-diffusion-v1-4").to("cuda")
prompts = ["A turtle playing with a ball", "A monkey playing with a ball"]
edit_kwargs = {
"cross_replace_steps": 0.4,
"self_replace_steps": 0.4,
"local_blend_words": ["turtle", "monkey"]
}
outputs = pipe(prompt=prompts, height=512, width=512, num_inference_steps=50, edit_type='replace', edit_kwargs=edit_kwargs)
See this file for examples of all edit types.
@patrickvonplaten I have three questions:
Currently, the source and target images are produced simultaneously. In real-life usage, I would imagine they'd be created sequentially. I'd first create a source image, iterate through that until I'm happy with it, and only then edit it. This implies (a) saving the source image's attention maps to the pipe output object, and (b) accepting it as an optional parameter in __call__
. Should we do that?
Is integration testing with stable-diffusion-v1-4
and size 512 x 512
okay, or would you choose a smaller model / shape?
Should we add code to visualize the averaged attention masks? Would that be used in practice? For reference, the paper and @Weifeng-Chen's code include that.
The current code can be found in this repo.
What's left to do:
prompt_embeds
, negative_prompt_embeds
, text_encoder_lora_scale
)Thanks for giving me the opportunity to do this! Have learned a lot. Appreciate it :)
@evinpinar Thanks, your register_attention_control
is great for diffusers!!
Describe the solution you'd like Now that we have an official way to tweak cross attention https://github.com/huggingface/diffusers/pull/1639 , would be great to have a pipeline (be it official or community) for
prompt-to-prompt
and further implementations of the technique (such as EDICT).Describe alternatives you've considered @amirhertz official Prompt-to-Prompt implementation is built on top of diffusers 0.3.0 with their own cross attention manipulation function. @bloc97 community prompt-to-prompt implementation already uses diffusers, but it is pinned to version 0.4.1, also with a cross attention control of their own. @bram-w / Salesforce EDICT , that adds inversion to prompt-to-prompt (allowing you to edit real images) also uses the above as a base with some modifications for double precision for inversion.
So while alternatives exist, they require users to pin old versions of diffusers and not enjoy the latest advancements. Given this technique is very useful, having it on a pipeline within diffusers could be really great. Also could potentially leverage the technique to other models (Karlo, IF, etc.)
Additional context InstructPix2Pix and Imagic have shown how editing real and edited images is a trend. Prompt-to-prompt is a nice tool to have on that belt for practitioners, artists and professionals.