huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.05k stars 5.36k forks source link

Added the ability to set SDXL `Micro-Conditioning` embeddings as 0 #4208

Closed budui closed 1 year ago

budui commented 1 year ago

Is your feature request related to a problem? Please describe.

During the SDXL training process, it may be necessary to pass in a zero embedding as Micro-Conditioning embeddings:

https://github.com/Stability-AI/generative-models/blob/e25e4c0df1d01fb9720f62c73b4feab2e4003e3f/sgm/modules/encoders/modules.py#L151-L161


# those line will randomly set embedding as zero if `ucg_rate` > 0
                if embedder.ucg_rate > 0.0 and embedder.legacy_ucg_val is None:
                    emb = (
                        expand_dims_like(
                            torch.bernoulli(
                                (1.0 - embedder.ucg_rate)
                                * torch.ones(emb.shape[0], device=emb.device)
                            ),
                            emb,
                        )
                        * emb
                    )

https://github.com/Stability-AI/generative-models/blob/e25e4c0df1d01fb9720f62c73b4feab2e4003e3f/configs/example_training/txt2img-clipl-legacy-ucg-training.yaml#L65

# SDXL set  the `ucg_rate` of `original_size_as_tuple` embedder as 0.1. 
# so during traning, we need to pass zero embedding as added embedding for time embedding of Unet
            ucg_rate: 0.1
            input_key: original_size_as_tuple
            target: sgm.modules.encoders.modules.ConcatTimestepEmbedderND
            params:
              outdim: 256  # multiplied by two

Current SDXL-UNet2DConditionModel accepts encoder_hidden_states, time_ids and add_text_embeds as condition.

https://github.com/huggingface/diffusers/blob/2e53936c97d167713c9e97414160124861fa4b68/src/diffusers/models/unet_2d_condition.py#L843-L854

To correctly finetune the SDXL model, we need to randomly set the condition embeddings to 0 with a suitable probability. While it is easy to set encoder_hidden_states and add_text_embeds as zero embedding, It is impossible to zero time_embeds at line 849.

original SDXL uses different embedders to convert different micro-conditions into Fourier features. during training, different Fourier features are independently randomly set to 0. Therefore, UNet2DConditionModel need to be able to independently zero time_embeds part.

Describe the solution you'd like

Added the ability to set SDXL Micro-Conditioning embeddings as 0.

Describe alternatives you've considered

Perhaps it is possible to allow diffusers users to pass in a time_embeds, and if time_embeds exists, time_ids are no longer used?

if "time_embeds" in added_cond_kwargs:
    time_embeds = added_cond_kwargs.get("time_embeds") 
else:
    time_ids = added_cond_kwargs.get("time_ids") 
     time_embeds = self.add_time_proj(time_ids.flatten()) 
time_embeds = time_embeds.reshape((text_embeds.shape[0], -1)) 
sayakpaul commented 1 year ago

Thanks for the detailed issue. Yes, we're aware of this issue.

@patrickvonplaten I suppose you were working on it?

patrickvonplaten commented 1 year ago

Actually only now noticed this - thanks for bringing it up @budui !

Do you think it's also important to provide this feature for inference or just for training?

budui commented 1 year ago

Both training and inference should require this feature. For training, diffusers may need to have the ability to reproduce Stability AI's training scripts. For inference, the current SDXL Pipeline lacks the ability to specify a negative micro condition (specified as a specific value or zero embedding).

I did a quick experiment, specifying a negative condition:

A: condition and negative conditon use the same micro condition as diffusers SDXL pipeline doing now.


# prompt: "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
# seed: 1000
# original size (1024, 1024) vs (1024, 1024)
condition=dict(
        caption=prompt,
        crop_left=0,
        crop_top=0,
        original_height=1024,
        original_width=1024,
        target_height=1024,
        target_width=1024,
),
negative_condition=dict(
        caption="",
        crop_left=0,
        crop_top=0,
        original_height=1024,
        original_width=1024,
        target_height=1024,
        target_width=1024,
 ),

size1

B: Negative conditions use a lower original size, resulting in a clearer image

# prompt: "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
# seed: 1000
# original size (1024, 1024) vs (512, 512)
condition=dict(
        caption=prompt,
        crop_left=0,
        crop_top=0,
        original_height=1024,
        original_width=1024,
        target_height=1024,
        target_width=1024,
),
negative_condition=dict(
        caption="",
        crop_left=0,
        crop_top=0,
        original_height=512,
        original_width=512,
        target_height=1024,
        target_width=1024,
 ),

size2-512

I haven't come to the effect of using zero embedding as a negative condition, because I haven't found a quick workaround to do it. But I'd be happy to do more testing after diffusers add a way to specify zero embedding in UNet

sayakpaul commented 1 year ago

@budui sorry for the delay on our end. Would you maybe be willing to contribute this feature in a PR? We're more than happy to help out.

patrickvonplaten commented 1 year ago

@sayakpaul do you want to give this PR/issue a try?

sayakpaul commented 1 year ago

Yeah