jiwoogit / StyleID

[CVPR 2024 Highlight] Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer
MIT License
164 stars 7 forks source link

Questions about attention extraction during DDIM inversion process #6

Closed LanTianBy closed 3 months ago

LanTianBy commented 3 months ago

Thank you for your amazing work! I have some questions about code details, and I would greatly appreciate your answers.

  1. Taking t=50 as an example, when t=50 is sampled to obtain z_49, are the K and V used from t=49 in the inverse process of DDIM? (i.e. the process of obtaining z_50 from t=49). Sorry, my description may not be clear enough. Perhaps the markings in the picture better express my doubts. I think this is consistent with the expression in the paper. In the code, I found that it seems that sampling starts at t=48 (the process of obtaining z_47), and the K and V used in sampling are inverse processes of the previous time step. As shown in the second picture. May I ask which expression in the picture is correct? thanks

q1

q2

  1. I tried to use unconditional diffusion(https://github.com/openai/improved-diffusion) pre trained on imageNet as a backbone to reproduce the code, but it did not show the expected effect. Of course, I suspect this may be related to my handling of problem 1. But I also doubt whether this method is only applicable to the UNET structure of SD? Although the Unet structure I use is very similar to the one in SD, it also has 16 AttentionBlocks

Thank you again for your wonderful work, and I wish you all the best in your future work!

jiwoogit commented 3 months ago

Thank you for your interest!

  1. I haven't fully debugged the code regarding your issue yet, but my intention is the paper figure (the first in your image). It could be my mistake, but I believe there would be negligible output difference between the first and second figures because adjacent time step features are similar.

  2. I haven't tested our method on Improved DDPM settings, but I think additional attention analysis will be needed to adopt their architecture (e.g., selecting which layer does "style injection"). Improved DDPM seems similar to the DiffuseIT style transfer setting. So, it might be more important to analyze the U-Net bottleneck or the skip connections in their setting, as noted in DiffuseIT [A] (they considered both for style transfer). I think it could be worthwhile to try "style injection" in other attention layers (e.g. near bottleneck attention layers?).

[A] Jeong, et al. "Training-free Content Injection using h-space in Diffusion Models." WACV. 2024.

LanTianBy commented 3 months ago
  • I haven't fully debugged the code regarding your issue yet, but my intention is the paper figure (the first in your image). It could be my mistake, but I believe there would be negligible output difference between the first and second figures because adjacent time step features are similar.
  • I haven't tested our method on Improved DDPM settings, but I think additional attention analysis will be needed to adopt their architecture (e.g., selecting which layer does "style injection"). Improved DDPM seems similar to the DiffuseIT style transfer setting. So, it might be more important to analyze the U-Net bottleneck or the skip connections in their setting, as noted in DiffuseIT [A] (they considered both for style transfer). I think it could be worthwhile to try "style injection" in other attention layers (e.g. near bottleneck attention layers?).

Thank you very much for your prompt and patient reply. Wishing you all the best! Thank you again for your wonderful work!