Closed budui closed 1 year ago
Thanks for the detailed issue. Yes, we're aware of this issue.
@patrickvonplaten I suppose you were working on it?
Actually only now noticed this - thanks for bringing it up @budui !
Do you think it's also important to provide this feature for inference or just for training?
Both training and inference should require this feature. For training, diffusers may need to have the ability to reproduce Stability AI's training scripts. For inference, the current SDXL Pipeline lacks the ability to specify a negative micro condition (specified as a specific value or zero embedding).
I did a quick experiment, specifying a negative condition:
# prompt: "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
# seed: 1000
# original size (1024, 1024) vs (1024, 1024)
condition=dict(
caption=prompt,
crop_left=0,
crop_top=0,
original_height=1024,
original_width=1024,
target_height=1024,
target_width=1024,
),
negative_condition=dict(
caption="",
crop_left=0,
crop_top=0,
original_height=1024,
original_width=1024,
target_height=1024,
target_width=1024,
),
# prompt: "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
# seed: 1000
# original size (1024, 1024) vs (512, 512)
condition=dict(
caption=prompt,
crop_left=0,
crop_top=0,
original_height=1024,
original_width=1024,
target_height=1024,
target_width=1024,
),
negative_condition=dict(
caption="",
crop_left=0,
crop_top=0,
original_height=512,
original_width=512,
target_height=1024,
target_width=1024,
),
I haven't come to the effect of using zero embedding as a negative condition, because I haven't found a quick workaround to do it. But I'd be happy to do more testing after diffusers add a way to specify zero embedding in UNet
@budui sorry for the delay on our end. Would you maybe be willing to contribute this feature in a PR? We're more than happy to help out.
@sayakpaul do you want to give this PR/issue a try?
Yeah
Is your feature request related to a problem? Please describe.
During the SDXL training process, it may be necessary to pass in a zero embedding as
Micro-Conditioning
embeddings:https://github.com/Stability-AI/generative-models/blob/e25e4c0df1d01fb9720f62c73b4feab2e4003e3f/sgm/modules/encoders/modules.py#L151-L161
https://github.com/Stability-AI/generative-models/blob/e25e4c0df1d01fb9720f62c73b4feab2e4003e3f/configs/example_training/txt2img-clipl-legacy-ucg-training.yaml#L65
Current SDXL-
UNet2DConditionModel
acceptsencoder_hidden_states
,time_ids
andadd_text_embeds
as condition.https://github.com/huggingface/diffusers/blob/2e53936c97d167713c9e97414160124861fa4b68/src/diffusers/models/unet_2d_condition.py#L843-L854
To correctly finetune the SDXL model, we need to randomly set the condition embeddings to 0 with a suitable probability. While it is easy to set
encoder_hidden_states
andadd_text_embeds
as zero embedding, It is impossible to zerotime_embeds
at line 849.original SDXL uses different embedders to convert different micro-conditions into Fourier features. during training, different Fourier features are independently randomly set to 0. Therefore,
UNet2DConditionModel
need to be able to independently zerotime_embeds
part.Describe the solution you'd like
Added the ability to set SDXL
Micro-Conditioning
embeddings as 0.Describe alternatives you've considered
Perhaps it is possible to allow diffusers users to pass in a
time_embeds
, and iftime_embeds
exists,time_ids
are no longer used?