showlab / BoxDiff

[ICCV 2023] BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
242 stars 14 forks source link

What does argument "normalize_eot" imply? #10

Closed zjysteven closed 8 months ago

zjysteven commented 8 months ago

Hi,

I'm recently working on adapting BoxDiff into the latest diffusers library, including the integration for both SD and SDXL. I came across this argument normalize_eot here: https://github.com/showlab/BoxDiff/blob/9e90000921be244468bcba4779e3c8b2c4dfb086/pipeline/sd_pipeline_boxdiff.py#L194-L198

It is set to True for SD2.1 and False for SD1.5. I'm not super familiar with the details of different versions, so would you mind clarifying what is the purpose of this argument? Thank you in advance.

Sierkinhane commented 8 months ago

Hi, this argument affects the value of last_idx, which will accordingly affect the crosss-attention to be normalized (see code below). In fact, in this repository, we did not take into account the cross-attention of token of 'eot'. However, in other versions of SD, you can choose to include it or not, depending on which one yields better results.

attention_for_text = attention_maps[:, :, 1:last_idx]
zjysteven commented 8 months ago

I see. Another quick question, do you recommend or not using iterative refinement? I see in the code comment it says "not necessary", but one of the example command in README uses refinement. I would imagine that it will slow the generation. How much image quality improvement can it bring?

Sierkinhane commented 8 months ago

I haven't conducted quantitative experiments regarding the refinement, but based on my experience, refinement can make the image more realistic in some cases.

zjysteven commented 8 months ago

I see. Thank you for sharing the experience!