Add clip skip for diffusion pipeline

NormXU commented 1 year ago

Introduction

clip skip is a trick to feed the early-stopped features encoded by CLIPTextModel into the cross-attention. If clip_skip = 2, it means that we want to use the features from the layer before the last of the clip text encoder to guide our image generation. And our current diffusion pipeline can be regarded as clip_skip = 1, which means that we just use the feature from the last layer of clip text encoder.

Here is a brief introduction to clip skip webui-wiki and related discussion link

A dominant majority of models need clip_skip=2 to reach a more aesthetic generation. I think adding this feature can give people more choices to optimize their generation.

Implementation

Adding clip_skip into diffusers is both simple and difficult. The main idea of clip_skip is simple, however, since our text encoder is imported from transformers, it is not easy to hack the CLIPTexModel in diffusers.

To do so, we need to overwrite CLIPTextModeland CLIPTextTransformer. Here is my implementation:

class MyCLIPTextTransformer(CLIPTextTransformer):

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        clip_skip: Optional[int] = 1, # <-- newly added: take the last N layer of encoder as output
    ) -> Union[Tuple, BaseModelOutputWithPooling]:

        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if input_ids is None:
            raise ValueError("You have to specify input_ids")

        input_shape = input_ids.size()
        input_ids = input_ids.view(-1, input_shape[-1])

        hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)

        bsz, seq_len = input_shape
        # CLIP's text model uses causal mask, prepare it here.
        # https://github.com/openai/CLIP/blob/cfcffb90e69f37bf2ff1e988237a0fbe41f33c04/clip/model.py#L324
        causal_attention_mask = self._build_causal_attention_mask(bsz, seq_len, hidden_states.dtype).to(
            hidden_states.device
        )
        # expand attention_mask
        if attention_mask is not None:
            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
            attention_mask = _expand_mask(attention_mask, hidden_states.dtype)

        encoder_outputs = self.encoder(
            inputs_embeds=hidden_states,
            attention_mask=attention_mask,
            causal_attention_mask=causal_attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=True,
            return_dict=return_dict,
        )

        last_hidden_state = encoder_outputs.hidden_states[-clip_skip] # <-- newly added:  take the last N layer of encoder as output
        last_hidden_state = self.final_layer_norm(last_hidden_state)

        # text_embeds.shape = [batch_size, sequence_length, transformer.width]
        # take features from the eot embedding (eot_token is the highest number in each sequence)
        # casting to torch.int for onnx compatibility: argmax doesn't support int64 inputs with opset 14
        pooled_output = last_hidden_state[
            torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device),
            input_ids.to(dtype=torch.int, device=last_hidden_state.device).argmax(dim=-1),
        ]

        if not return_dict:
            return (last_hidden_state, pooled_output) + encoder_outputs[1:]

        return BaseModelOutputWithPooling(
            last_hidden_state=last_hidden_state,
            pooler_output=pooled_output,
            hidden_states=encoder_outputs.hidden_states,
            attentions=encoder_outputs.attentions,
        )

    def _build_causal_attention_mask(self, bsz, seq_len, dtype):
        # lazily create causal attention mask, with full attention between the vision tokens
        # pytorch uses additive attention mask; fill with -inf
        mask = torch.empty(bsz, seq_len, seq_len, dtype=dtype)
        mask.fill_(torch.tensor(torch.finfo(dtype).min))
        mask.triu_(1)  # zero out the lower diagonal
        mask = mask.unsqueeze(1)  # expand mask
        return mask

class MyCLIPTextModel(CLIPTextModel):
    config_class = CLIPTextConfig

    _no_split_modules = ["CLIPEncoderLayer"]

    def __init__(self, config: CLIPTextConfig):
        super().__init__(config)
        self.text_model = MyCLIPTextTransformer(config) # <-- newly added:  use the overrided clip_text_transformer
        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        clip_skip: Optional[int] = 1, # <-- newly added:  take the last N layer of encoder as output
    ) -> Union[Tuple, BaseModelOutputWithPooling]:
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        return self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
            clip_skip=clip_skip  # <-- newly added:  take the last N layer of encoder as output
        )

We can then use the overwritten clip_text_encoder in any_encode_promptfunction of the diffuser pipeline. For example, in pipeline_stable_diffusion.py

    def _encode_prompt(
        self,
        prompt,
        device,
        num_images_per_prompt,
        do_classifier_free_guidance,
        negative_prompt=None,
        prompt_embeds: Optional[torch.FloatTensor] = None,
        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
       clip_skip: Optional[int] = 1,
    ):
        # ......  # Omit Unchanged Codes

        prompt_embeds = self.text_encoder(
            text_input_ids.to(device),
            clip_skip=clip_skip,  
            attention_mask=attention_mask,
        )
        prompt_embeds = prompt_embeds[0]

        prompt_embeds = prompt_embeds.to(dtype=self.text_encoder.dtype, device=device)

        # ......  # Omit Unchanged Codes
        # this trick often applies on prompt embedding instead of negative prompt embedding
        negative_prompt_embeds = self.text_encoder(
            uncond_input.input_ids.to(device),
            attention_mask=attention_mask,
        )
        negative_prompt_embeds = negative_prompt_embeds[0]

       # ......  # Omit Unchanged Codes

        return prompt_embeds

ImplementingCLIPTextTransformerand CLIPTextModelto support clip_skip cleanly and nicely is difficult for me. I'd like to leave this issue to the diffusers team.

patrickvonplaten commented 1 year ago

Hey @NormXU, you could also just do the following no:

# we skip one layer of the encoder
text_encoder = CLIPTextModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="text_encoder", num_hidden_layers=11, torch_dtype=torch.float16)

controlnet = ControlNetModel.from_pretrained(checkpoint, torch_dtype=torch.float16)

By loading the text encoder only with 11 layers you are skipping the final layer.

NormXU commented 1 year ago

Hey @patrickvonplaten, thank you for your quick reply.

Setting num_hidden_layers while initializing the text encoder is a good choice. However, I think it will be easier to use if we can set clip_skip as an editable parameter for diffusion forward inference, such as:

@torch.no_grad()
    def __call__(
        self,
        clip_skip: Optional[int] = 1,
    ):

Or, we need to initialize a text encoder every time we want to change the value of clip_skip.

patrickvonplaten commented 1 year ago

@NormXU, can you explain a bit more when people would want to change clip_skip during runtime? Usually a model performs best with one specific layer output not with multiple IMO so I don't think one would really change this during runtime. Can you explain when this is necessary?

NormXU commented 1 year ago

@patrickvonplaten I usually change the clip_skip to see whether I can get a better-looking generation, especially when I download a new model and want to see which layer output from the text encoder can best fit my text description. I also like to compare how the clip text encoder can influence the generation by setting clip_skipfrom 1 to 3,

But you are right, in my use cases, it seems that it is not necessary to introduce a new parameter into the runtime. 😂

CLIPTextModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="text_encoder", num_hidden_layers=11, torch_dtype=torch.float16)

is good and convenient enough to use.

sandeshrajbhandari commented 1 year ago

@NormXU, can you explain a bit more when people would want to change clip_skip during runtime? Usually a model performs best with one specific layer output not with multiple IMO so I don't think one would really change this during runtime. Can you explain when this is necessary?

clip_skip 2 is used generally with models like Anything. using clip_skip 2 has shown positive results during image generation compared to clip_skip 1 for anime-related models. However, most users tend to use either clip_skip 1 and 2 to ideate, and using the num_hidden_layers=11 should suffice.

DaXu999 commented 1 year ago

Any guidance on how to use this when loading models from .safetensor files:

pipeline = StableDiffusionPipeline.from_single_file( "/.safetensors",torch_dtype=torch.float16,clip_model=clip_model )

the clip_model=clip_model argument is being ignored

patrickvonplaten commented 1 year ago

Where do you see clip_model as being an input? I think it should be text_encoder=clip_model no?

DaXu999 commented 1 year ago

Oh yeah, that seems to have worked. I've got a follow up question on this though. I'm running into this error when combining this with compel, I've disabled the truncation of the prompts but am running into this error on this line: negative_conditioning = compel.build_conditioning_tensor(negative_prompt) 'index out of range in self'

zetyquickly commented 1 year ago

@patrickvonplaten I usually change the clip_skip to see whether I can get a better-looking generation, especially when I download a new model and want to see which layer output from the text encoder can best fit my text description. I also like to compare how the clip text encoder can influence the generation by setting clip_skipfrom 1 to 3,

But you are right, in my use cases, it seems that it is not necessary to introduce a new parameter into the runtime. 😂
CLIPTextModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="text_encoder", num_hidden_layers=11, torch_dtype=torch.float16)
is good and convenient enough to use.

Hello. Given the recent introduction of new functionality, our existing approach may no longer be viable. Consider the implementation of LoraLoaderMixin:

class LoraLoaderMixin:
....
def load_lora_weights(...)
....

Should there be a mismatch between the keys of LoRA weights and the CLIPTextModel, the loading process could be disrupted. For instance, if the LoRA weights have 12 layers, but the text encoder only encompasses 11, an error similar to the following would be encountered:

...
File "/opt/conda/envs/backend-new/lib/python3.7/site-packages/diffusers/loaders.py", line 882, in load_lora_weights
    lora_scale=self.lora_scale,
  File "/opt/conda/envs/backend-new/lib/python3.7/site-packages/diffusers/loaders.py", line 1148, in load_lora_into_text_encoder
    f"failed to load text encoder state dict, unexpected keys: {load_state_dict_results.unexpected_keys}"
ValueError: failed to load text encoder state dict, unexpected keys: ['text_model.encoder.layers.11.mlp.fc1.lora_linear_layer.down.weight', ...

Given this scenario, I propose two potential solutions:

Incorporate clip_skip as an argument in __call__ and modify the inference flow accordingly.
Implement a failsafe loading mechanism for LoRA-like mixin extensions.

Gynjn commented 1 year ago

How can i define CLIPTextModel?

patrickvonplaten commented 1 year ago

Hmm is clip_skip really being that important as a feature? I still haven't seen a use case where a model produces better results with clip_skip different to what it was trained on in inference. Would love to see some concrete examples.

Also cc @sayakpaul and @yiyixuxu FYI

neggles commented 1 year ago

Almost every SD1.5 anime model is based off the NAI model, which was trained for a huge number of steps at CLIP Skip 2. If you run them at CLIP Skip 1 (which AIUI is a bit of a misnomer since it doesn't skip any layers, but whatever), you'll often get weird glitches in fine details or strange fractals in the background; it's been the source of quite some annoyance with some merges/finetunes.

A couple of not particularly great examples:

xyz_grid-0489-3736956886-masterpiece, best quality, cute girl, smiling, white dress, off shoulder, medium breasts, peaceful smile, eyes half closed, hair

xyz_grid-0490-863744588-masterpiece, best quality, cute girl, white dress, medium breasts, hair over one eye, peaceful smile, swept bangs, purple hair,

The second one is a better example; at CLIP skip 1 the model failed to pick up on the arms behind back in the prompt, but it followed it correctly at CLIP skip 2.

sayakpaul commented 1 year ago

Is this specific to anime models only? How does this approach generalize to other kinds of models?

The solution shown in the OP seems like the best approach as we cannot really change things at the end of transformers.

If we can see some diverse and varied examples where using CLIP Skip has been truly crucial, I think the original solution proposed in the OP could be directly incorporated.

patrickvonplaten commented 1 year ago

Do you have any reproducible code @neggles ?

neggles commented 1 year ago

Sorry for the slow reply, have a lot of irons in the fire 😅

Is this specific to anime models only? How does this approach generalize to other kinds of models?

It's most effective for anime models, since they were largely trained using the penultimate layer states, but the same approach works just fine on base SD1.5 (albeit with rather mixed results) as well as on models like OpenJourney v4:

portrait of a girl, long blonde hair, white off shoulder dress, gold trim, digital art, chiaroscuro, light particles, cinematic light, dramatic, greg rutkowski, looking at viewer, masterpiece, intricate, best quality, fine detail, artgerm, outdoors, forest
Negative: bad artist, bad quality, monochrome, blurry, bad hands, low quality, jpeg artifacts
576x640, DPMSolverSinglestepScheduler with karras sigmas, 21 steps, CFG at 7.75, seed of 3134342485.

(This was generated using A1111, but the Diffusers outputs should be similar)

It's also worth noting that SD2.1 was trained entirely using CLIP penultimate layer states (on an unfortunately undertrained text encoder, but that's beside the point).

Do you have any reproducible code @neggles ?

My implementation of the solution OP proposed is used in my AnimateDiff fork - seems to work fine. It's almost a straight copy, but I tried to stick a little closer to the existing Diffusers code style. I just instantiate the pipeline by loading the CLIPSkipTextModel manually then passing it to StableDiffusionPipeline() (or in that particular case the animatediff version, but the concept is the same)

tl;dr yeah, it's primarily an anime model thing, but it just comes down to how the model was trained, and an awful lot of models - even realism-focused ones - are based off anime models / trained with CLIP penultimate layer states. It's really just another tweakable knob 🤷

sayakpaul commented 1 year ago

@neggles what about the other inference time parameters?

Scheduler
Number of inference steps

etc.?

mudler commented 1 year ago

very interested in this - I'm struggling currently to get this to work with single_files checkpoints.

mudler commented 1 year ago

Hey @NormXU, you could also just do the following no:
# we skip one layer of the encoder
text_encoder = CLIPTextModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="text_encoder", num_hidden_layers=11, torch_dtype=torch.float16)

controlnet = ControlNetModel.from_pretrained(checkpoint, torch_dtype=torch.float16)
By loading the text encoder only with 11 layers you are skipping the final layer.

while this seem to work, the problem of this approach is that one has to guess how many layers there are, and as far my knowledge of python goes (not that much) couldn't find a proper way to do so.

neggles commented 1 year ago

@sayakpaul see last line of code block - 576x640, DPMSolverSinglestepScheduler with karras sigmas, 21 steps, CFG at 7.75, seed of 3134342485. 😉

patrickvonplaten commented 1 year ago

This feature does seem to be requested quite a bit now - design-wise we have three options:

1. Add a method to from_pretrained(...) => :no_entry_sign: because pipeline from pretrained should only be for components
1. Add it to the config => :no_entry_sign: because it would forbid users to use it during inference and it seems like even if models are trained with clip skip some people want to use it
1. Add it as an argument to __call__ => don't think that's a good idea because clip_skip is very rarely switched for inference
1. Add a setter method:

def set_clip_output_layer(output_layer_idx: int):
 pass

which would allow the user to do:

pipe = DiffusionPipeline.from_pretrained("...")
pipe.set_clip_output_layer(...)

=> This would be a relatively simply PR where we only have to add the function to one SD pipeline and can then copy it to all other pipelines.

Wdyt? @neggles @sayakpaul ?

zetyquickly commented 1 year ago

@patrickvonplaten I appreciate the approach with the setter method, but I have a couple of suggestions that might further optimize this process:

I suggest we target pipe.text_encoder directly instead of the broader pipe. By doing so, we can benefit from modifying text_encoder independently. This would be especially handy for tasks like retrieving emphasized embeds with tools like compel.

Secondly, it's crucial to ensure the robustness of the load_lora_weights method's compatibility with this setter. For instance, if only 10 out of the 12 blocks are present in the LoRA weights, the weights should still load correctly, in case when the setter is configured with skip == 2

patrickvonplaten commented 1 year ago

Good points! Note however that the text_encoder lives in transformers not in diffusers so we cannot directly modify it. I'm also still a bit unsure about the importance of such a method here - is there really that much demand/need for it?

NormXU commented 1 year ago

@patrickvonplaten I think this feature is fun for play and valuable for research.

For playing purposes: @neggles has demonstrated how clip_skip affects the generation. It can ensure SD adheres to the text instructions strictly. I think this is because the CLIPtext encoder is trained by contrastive learning between images and their captions. These captions can hardly preserve rich information, such as lighting, color, body gesture, and relationship but only capture a small portion of information necessary for discriminating from other negative samples in a training batch. Empirically, using earlier text encoder output can somehow mitigate this problem.

For research purposes: SDXL has incorporated this trick into its architecture:

We opt for a more powerful pre-trained text encoder that we use for text conditioning. Specifically, we use OpenCLIP ViT-bigG [19] in combination with CLIP ViT-L [34], where we concatenate the penultimate text encoder outputs along the channel axis.

It looks like they also noticed the problem with the text encoder and solved it by concatenating two penultimate text encoder outputs along the channel axis. This is also a 'clip_skip' trick. SDXL paper link

Notably, recently VLM(Visual-Language Model), such as LLaVa, BLIVA, also use this trick to align the penultimate image features with LLM, which they claim can give better results.

We selected the ViT-G/14 from EVA-CLIP (Sun et al. 2023) as our visual encoder. The pre-trained weights are initialized and remain frozen during training. We removed the last layer from ViT (Dosovitskiy et al. 2020) and opted to use the output features of the second last layer, which yielded slightly better performance

-- from BLIVA paper

zetyquickly commented 1 year ago

Good points! Note however that the text_encoder lives in transformers not in diffusers so we cannot directly modify it. I'm also still a bit unsure about the importance of such a method here - is there really that much demand/need for it?

Certainly, while clip_skip is a noteworthy feature, imo it seems that matters related to SDXL and its subsequent enhancements might take precedence in the priority list.

neggles commented 1 year ago

[snip] Add a setter method:
def set_clip_output_layer(output_layer_idx: int):
     pass
which would allow the user to do:
pipe = DiffusionPipeline.from_pretrained("...")
pipe.set_clip_output_layer(...)
=> This would be a relatively simply PR where we only have to add the function to one SD pipeline and can then copy it to all other pipelines.

Wdyt? @neggles @sayakpaul ?

This seems like a reasonable approach to me; keeps from having to fiddle with Transformers directly and is straightforward to use/implement. It would be nicer to have it as an argument for __call__() but it's not necessary by any means, and while people do sometimes want to change it between inference batches it's not exactly difficult to add an extra method call as/where needed.

Come to think of it, this could also be quite handy for training use cases (wrapping the pipeline in a trainer class is a common way to approach that) depending on how it's implemented.

I had a whole thing I was going to drop in here summarizing the argument in favour from this end, but based on #4834 it seems like I don't need to bother 😆

deJQK commented 1 year ago

@patrickvonplaten I usually change the clip_skip to see whether I can get a better-looking generation, especially when I download a new model and want to see which layer output from the text encoder can best fit my text description. I also like to compare how the clip text encoder can influence the generation by setting clip_skipfrom 1 to 3, But you are right, in my use cases, it seems that it is not necessary to introduce a new parameter into the runtime. 😂
CLIPTextModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="text_encoder", num_hidden_layers=11, torch_dtype=torch.float16)
is good and convenient enough to use.
Hello. Given the recent introduction of new functionality, our existing approach may no longer be viable. Consider the implementation of LoraLoaderMixin:
class LoraLoaderMixin:
....
def load_lora_weights(...)
....
Should there be a mismatch between the keys of LoRA weights and the CLIPTextModel, the loading process could be disrupted. For instance, if the LoRA weights have 12 layers, but the text encoder only encompasses 11, an error similar to the following would be encountered:
...
File "/opt/conda/envs/backend-new/lib/python3.7/site-packages/diffusers/loaders.py", line 882, in load_lora_weights
    lora_scale=self.lora_scale,
  File "/opt/conda/envs/backend-new/lib/python3.7/site-packages/diffusers/loaders.py", line 1148, in load_lora_into_text_encoder
    f"failed to load text encoder state dict, unexpected keys: {load_state_dict_results.unexpected_keys}"
ValueError: failed to load text encoder state dict, unexpected keys: ['text_model.encoder.layers.11.mlp.fc1.lora_linear_layer.down.weight', ...
Given this scenario, I propose two potential solutions:

Incorporate clip_skip as an argument in __call__ and modify the inference flow accordingly.

Implement a failsafe loading mechanism for LoRA-like mixin extensions.

I met the similar issue and my solution is something like this:

from safetensors.torch import load_file

ckpt_path = '/path/to/ckpt.safetensors'
state_dict_lora = load_file(ckpt_path, device='cpu')
new_state_dict_lora = {}
for k_, v_ in state_dict_lora.items():
    invalid_key = any(f'text_model_encoder_layers_{11 - layer_idx_}_' in k_ for layer_idx_ in range(clip_skip))
    if not invalid_key:
        new_state_dict_lora[k_] = v_
pipe.load_lora_weights(new_state_dict_lora)

It works well for the method from_pretrained, but seems not work well for from_single_file. The error is something that the function download_from_original_stable_diffusion_ckpt will call convert_ldm_clip_checkpoint and find some module is not in the model (the function set_module_tensor_to_device from accelerate).

alexblattner commented 1 year ago

@NormXU you can already do this like this:

clip_layers = pipe.text_encoder.text_model.encoder.layers
if clip_skip > 0:
        pipe.text_encoder.text_model.encoder.layers = clip_layers[:-clip_skip]

patrickvonplaten commented 12 months ago

PR has been merged: https://github.com/huggingface/diffusers/issues/3870 , please try it out

zetyquickly commented 12 months ago

https://github.com/huggingface/diffusers/pull/5057

andupotorac commented 3 months ago

Hmm is clip_skip really being that important as a feature? I still haven't seen a use case where a model produces better results with clip_skip different to what it was trained on in inference. Would love to see some concrete examples.

Also cc @sayakpaul and @yiyixuxu FYI

Example with clipskip 1 vs 2.

huggingface / diffusers

Add clip skip for diffusion pipeline #3212

Introduction

Implementation