Attention Processor Implementations for SDXL

Hi!

I am working on migrating this great work to SDXL, starting with AFA. However, I found neither direct cascade nor AFA work.

I am using the Clip-ViT-big-G (project to 1280) and Clip-ViT-Large (project to 768) for image embedding and concatenate them as 768+1280=text embedding size. That would be considered as a token. And I parse them to the attention processors. Inside the attention, the image embedding is repeated to 77 to match text features.

However, even when using the direct concate method, I still find it no working. Any suggestion?

google / RB-Modulation

Attention Processor Implementations for SDXL #4