Closed okaris closed 1 year ago
Thank you for your interest in our work and the efforts to implement it on diffusers. The image cross attention is applied between two sets of patch tokens (from the reference image and the denoised image), which doesn't involve tokens from the prompt. Let me simplify the whole operation process in one conditioning block (omit the mask):
All cross-attention operations are applied to all tokens. Only the mask is obtained from the single S* token.
Hope the above may help your work!
Thanks @haoosz it does help! 🙌🏻
One more question. To be able to implement in diffusers without disrupting the whole code, I am running the unet seperately for referenced image conditions, collect the attention maps and apply them while running the unet for denoised image condition + out visual conditioned.
Do you see any errors running the vanilla unet separetely for reference image?
I think it is fine because the reference image only goes through the vanilla unet.
@haoosz Thank you for the amazing work and the open source code. I have been working to implement it on huggingface/diffusers. I believe the architecture in place but even with regularization and masking my models don't converge in terms of loss but the results are overfitted and distorted with the subject appearance.
I have gone through the code and the paper several time and I have one question I can't answer.
Is image cross attention applied for all tokens in the prompt or is it only computed for S* token (therefore using vanilla attention maps for other tokens)