More thorough guidance for multiple IP adapter images/masks and a single IP Adapter

Describe the bug

I'm trying to use a single IP adapter with multiple IP adapter images and masks. This section of the docs gives an example of how I could do that: https://huggingface.co/docs/diffusers/v0.29.0/en/using-diffusers/ip_adapter#ip-adapter-masking

The docs provide the following code:

from diffusers.image_processor import IPAdapterMaskProcessor

mask1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask1.png")
mask2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask2.png")

output_height = 1024
output_width = 1024

processor = IPAdapterMaskProcessor()
masks = processor.preprocess([mask1, mask2], height=output_height, width=output_width)

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name=["ip-adapter-plus-face_sdxl_vit-h.safetensors"])
pipeline.set_ip_adapter_scale([[0.7, 0.7]])  # one scale for each image-mask pair

face_image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png")
face_image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl2.png")

ip_images = [[face_image1, face_image2]]

masks = [masks.reshape(1, masks.shape[0], masks.shape[2], masks.shape[3])]

generator = torch.Generator(device="cpu").manual_seed(0)
num_images = 1

image = pipeline(
    prompt="2 girls",
    ip_adapter_image=ip_images,
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
    num_inference_steps=20,
    num_images_per_prompt=num_images,
    generator=generator,
    cross_attention_kwargs={"ip_adapter_masks": masks}
).images[0]

One important point that should be highlighted is that images/scales/masks must be lists of lists , otherwise we get the following error: Cannot assign 2 scale_configs to 1 IP-Adapter.

That error message is intuitive enough, however this gets confusing in other sections of the documentation, such as the set_ip_adapter_scale() function:

# To use original IP-Adapter
scale = 1.0
pipeline.set_ip_adapter_scale(scale)

# To use style block only
scale = {
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale(scale)

# To use style+layout blocks
scale = {
    "down": {"block_2": [0.0, 1.0]},
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale(scale)

# To use style and layout from 2 reference images
scales = [{"down": {"block_2": [0.0, 1.0]}}, {"up": {"block_0": [0.0, 1.0, 0.0]}}]
pipeline.set_ip_adapter_scale(scales)

Is it possible to use the style and layout from 2 reference images with a single IP Adapter? I tried doing something like the following, which builds on the knowledge of needing to use a list of lists:

# List of lists to support multiple images/scales/masks with a single IP Adapter
scales = [[{"down": {"block_2": [0.0, 1.0]}}, {"up": {"block_0": [0.0, 1.0, 0.0]}}]]
pipeline.set_ip_adapter_scale(scales)

# OR

# Use layout and style from InstantStyle for one image, but also use a numerical scale value for the other
scale = {
    "down": {"block_2": [0.0, 1.0]},
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale([[0.5, scale]])

but I get the following error:

TypeError: unsupported operand type(s) for *: 'dict' and 'Tensor'\n
At:
 /usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py(2725): __call__
/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py(549): forward
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl\n  /usr/local/lib/python3.10/dist-packages/diffusers/models/attention.py(366): forward\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl\n  /usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/transformer_2d.py(440): forward\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl\n  /usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_2d_blocks.py(1288): forward\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl\n  /usr/local/lib/python3.10/dist-packages/diffusers/models/unets/unet_2d_condition.py(1220): forward\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1527): _call_impl\n  /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py(1518): _wrapped_call_impl\n  /usr/local/lib/python3.10/dist-packages/diffusers/pipelines/controlnet/pipeline_controlnet_sd_xl.py(1510): __call__\n  /usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py(115): decorate_context

Reproduction

Load single IP Adapter into pipeline
Use two IP adapter images, two masks, two scales
Try to use InstantStyle config to set IP Adapter scale

from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
import torch
import PIL

# Subject/Foreground Style/Mask
subject_style_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg")
subject_mask = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask1.png")

# Background Style/Mask
background_style_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
background_mask = PIL.ImageOps.invert(subject_mask)

# Load pipeline + IP Adapter
pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin")
generator = torch.Generator(device="cpu").manual_seed(26)

# Structure of subject, style of background
layout = {"down": {"block_2": [0.0, 1.0]}}
style = {"up": {"block_0": [0.0, 1.0, 0.0]}}
pipeline.set_ip_adapter_scale([[layout, style]])

# Preprocess mask images
processor = IPAdapterMaskProcessor()
ip_adapter_masks = processor.preprocess([subject_mask, background_mask]).cuda() # Might need to set width/height here
ip_adapter_masks = [
    ip_adapter_masks.reshape(
        1, ip_adapter_masks.shape[0], ip_adapter_masks.shape[2], ip_adapter_masks.shape[3]
    )
]

ip_adapter_images = [[subject_style_image, background_style_image]]

image = pipeline(
    prompt="a cat, masterpiece, best quality, high quality",
    ip_adapter_image=ip_adapter_images,
    negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry",
    guidance_scale=5,
    num_inference_steps=30,
    generator=generator,
    cross_attention_kwargs={"ip_adapter_masks": ip_adapter_masks}
).images[0]

Logs

No response

System Info

diffusers version: 0.27.2
Platform: Linux-6.5.0-1020-gcp-x86_64-with-glibc2.35
Python version: 3.10.1
PyTorch version (GPU?): 2.1.2+cu121 (True)
Huggingface_hub version: 0.21.1
Transformers version: 4.39.2
Accelerate version: 0.28.0
xFormers version: 0.0.23.post1
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@sayakpaul @yiyixuxu @sayakpaul

Cc: @fabiorigano @asomoza

Is it possible to use the style and layout from 2 reference images with a single IP Adapter?

If you want to use the style of one image and the layout from the other one, you'll need to load two IP Adapters, if you pass multiple images to just one IP Adapter it will grab the features of both of them combined.

You shouldn't be able to pass a list of scales to a single IP Adapter so we're missing a check there I think.

I think there is an issue with scale function. The docs show this syntax in context of using two masks: pipeline.set_ip_adapter_scale([[0.7, 0.7]])

however as @chrismaltais notes above, and I also got the same error, if we do this: pipeline.set_ip_adapter_scale([[layout, style]])

we get the error. TypeError: unsupported operand type(s) for *: 'dict' and 'Tensor'\n

So the block specification is not allowed but scalar values are allowed?

oh yeah, you're right, the dict (block) scaling was added later with InstantStyle and this affects the IP Adapter attention layers, the list of scale values (float) was added to be able to set a different scale for each image.

I can see why this gets confusing really fast, so maybe we need to improve the docs?

You can't use one IP Adapter for two images where you want to use one as style and the second as a layout.
If you use a dict with the blocks or a list of dicts you're using InstanStyle
If you pass a list of floats you're using scales for each IP Adapter
If you pass a list of lists of floats, you need to pass multiple images and you're setting the scale for each image.
You can't pass a list of a list of dicts because what you're trying to do here is to set the scale of the attention layers for each image.

I think this is correct but can you confirm please @fabiorigano

cc: @stevhliu

Isnt that a code bug though, that scalars are possible but not a instant style specification in a nested list? From what I understood, the block specification in the default case == 1 as scalar config, but also permits finer grained spec. It does look like a instantstyle parsing issue.

Hello,

Would be great to get guidance on how to use IP adapter masks. I am getting some unpredictable results with IP Adapter. The output is sometimes just 1 person with both identities sort of blended together. Please advise if I'm doing something incorrect

Thanks in advance.

Input Images:

ip_mask_girl2 ip_mask_girl1

Result:

result

Code:

vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-ema").to(dtype=torch.float16)
image_encoder = CLIPVisionModelWithProjection.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K").to(dtype=torch.float16)
lcm_lora_id = "latent-consistency/lcm-lora-sdv1-5"
pipeline = StableDiffusionPipeline.from_pretrained(
    "SG161222/Realistic_Vision_V5.1_noVAE",
    torch_dtype=torch.float16,
    vae=vae,
    image_encoder=image_encoder,
    safety_checker=None,
).to("cuda")
pipeline.load_lora_weights(lcm_lora_id)
pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name=["ip-adapter-plus-face_sd15.bin"], image_encoder_folder=None)
pipeline.set_ip_adapter_scale([[0.9, 0.9]])

# Load and preprocess masks
mask1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask1.png")
mask2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask2.png")

output_height = 512
output_width = 512

processor = IPAdapterMaskProcessor()
masks = processor.preprocess([mask1, mask2], height=output_height, width=output_width)
masks = [masks.reshape(1, masks.shape[0], masks.shape[2], masks.shape[3])]

# face_image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png")
# face_image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl2.png")

# these are same as above but resized to 512x512
face_image1 = load_image("/content/ip_mask_girl1.png")
face_image2 = load_image("/content/ip_mask_girl2.png")

ip_images = [[face_image1, face_image2]]

# Set generator
generator = torch.Generator(device="cpu").manual_seed(1480)
prompts = ["2 girls"]
negative_prompt="(deformed iris, deformed pupils, semi-realistic, cgi, 3d, render, sketch, cartoon, drawing, anime:1.4), black and white, text, cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck"

# Run pipeline
# Run pipeline
images = pipeline(
    prompt=prompts,
    ip_adapter_image=ip_images,
    negative_prompt=[negative_prompt],
    num_inference_steps=10, num_images_per_prompt=3,
    generator=generator,
    cross_attention_kwargs={"ip_adapter_masks": masks},
    strength=0.45,
    width=512,
    height=512,
    guidance_scale=2.0,
).images

Hi, you're using scales that are to high, at most you should use 0.7, but ideally with 0.5, the higher the scale the more probability you'll get a one person blended.

The ones in the doc are examples and you should use something better to your use case, the example is with SDXL which has a higher resolution and IMO it understand better the input from IP Adapters also the masks are more precise with a 1024x1024 resolution.

Other issues I found in your code if you're interested:

The prompt is too simple and with "2girls" in a realistic model you're not giving it too much to work with.
You're using the same mask, the same prompt, higher scales and using "non-realistic images" with a model that was only trained with realistic images.
Your negative prompt has weighting too like (anime, ...., :1.4) which is just noise with the default diffusers.
On top of all, you're using LCM with a low guidance and fewer steps, so you're giving the model even less space to work with.

I think the results you got are really good if we take all this into account and I'm kind of surprised that you got them.

hi @asomoza

Thanks you for the detailed feedback. I will incorporate your suggestions going forward. To give details on some of the points:

I did have a more verbose prompt with realistic images but did not share those to preserve privacy of subjects involved and tried to reproduce the issue with the documentation examples for this report. Even then, I had to try a few times to get this result (blended identity), it was fine for initial few tests.

For my use-case I decided to use openpose controlnet for both subjects, so far I did not see this problem when I clearly segregate the subjects with controlnet.

One question on scale: does higher/lower scale impact the likeness of the result to input images?

Thanks again for taking the time to provide this feedback! :)

yeah, using controlnet really helps with this, I can even generate a group of people with each one having different characteristics or even styles.

does higher/lower scale impact the likeness of the result to input images?

Yes, the scale affects the likeness, but it all depends on the type of IP Adapter and the image. The plus ip adapters are a lot stronger so you'll need to lower the scale, and for the faces, if you're going to use a plus face IP Adapter, you can also use a separate mask for each face and can give a higher scale to each one to improve the likeness.

So I recommend using controlnet, I like a lot better to use something like mistoline with the contour of the people, a plus IP Adapter with masks for each person with lower scale and a Face IP Adapter with face masks for each one with higher scales.

huggingface / diffusers