Open heiheizwplus opened 3 weeks ago
Thanks for the extremely well-written issue. You seem to already have a handle on how this could be fixed. Would you maybe like to take a stab at opening a PR with the fix?
Cc: @yiyixuxu @asomoza
Gentle ping to keep the activity going
Describe the bug
The
prepare_ip_adapter_image_embeds
function has a bug that results in unintended feature mixing across images during batch processing. This issue causes the generated images to combine features from multiple reference images, instead of maintaining a one-to-one correspondence with each reference.When using the pipeline in batch mode, I use
ip_adapter_image_embeds
with a shape of(2*B, N, C)
and setnum_images_per_prompt=1
. I expect the pipeline to generateB
images, where each generated image should correspond directly to each reference inip_adapter_image_embeds
(note that2*B
includes the negative image embedding for classifier-free guidance).https://github.com/huggingface/diffusers/blob/31058cdaef63ca660a1a045281d156239fba8192/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L950-L957
https://github.com/huggingface/diffusers/blob/9a92b8177cb3f8bf4b095fff55da3b45a3607960/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L561-L569
However, when processing
ip_adapter_image_embeds
in the pipeline, the tensor gets duplicatednum_images_per_prompt * batch_size = 1 * B
times. This leads to theimage_embeds
tensor having a shape of(B*2*B, N, C)
instead of the expected shape of(2*B, N, C)
.In the
IPAdapterAttnProcessor2_0
class, the view operation is applied to the input image_embeds tensor. This prevents a shape mismatch error, but it leads toip_key
andip_value
containing mixed features from multiple reference images. As a result, the features of the generated images are a mixture of several reference images instead of having a one-to-one correspondence.https://github.com/huggingface/diffusers/blob/9a92b8177cb3f8bf4b095fff55da3b45a3607960/src/diffusers/models/attention_processor.py#L4112-L4122
*Although I temporarily resolved the issue by changing the `num_images_per_promptbatch_size
parameter passed to the
prepare_ip_adapter_image_embedsmethod to
num_images_per_prompt`, could this potentially cause issues in other scenarios?**Reproduction
Here’s a demo script that illustrates the issue. The script loads two reference images (image1 and image2), extracts their embeddings, and uses them as input to the pipeline in batch mode.
Reference Images
Generated Images in Batch Mode
Expected Behavior
Logs
No response
System Info
Who can help?
@asomoza