I am working on migrating this great work to SDXL, starting with AFA. However, I found neither direct cascade nor AFA work.
I am using the Clip-ViT-big-G (project to 1280) and Clip-ViT-Large (project to 768) for image embedding and concatenate them as 768+1280=text embedding size. That would be considered as a token. And I parse them to the attention processors. Inside the attention, the image embedding is repeated to 77 to match text features.
However, even when using the direct concate method, I still find it no working. Any suggestion?
Hi!
I am working on migrating this great work to SDXL, starting with AFA. However, I found neither direct cascade nor AFA work.
I am using the Clip-ViT-big-G (project to 1280) and Clip-ViT-Large (project to 768) for image embedding and concatenate them as 768+1280=text embedding size. That would be considered as a token. And I parse them to the attention processors. Inside the attention, the image embedding is repeated to 77 to match text features.
However, even when using the direct concate method, I still find it no working. Any suggestion?