tencent-ailab / IP-Adapter

The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.
Apache License 2.0
5.15k stars 334 forks source link

batch of images with different weights #87

Open cubiq opened 1 year ago

cubiq commented 1 year ago

when sending a batch of images for the conditioning, how would you go at giving a different weight at each of them?

xiaohu2015 commented 1 year ago

@cubiq hi, can you give a description of how to implement multiple images?

cubiq commented 1 year ago

you can send a batch of tensors in the form of like (4, 224, 224, 3) for 4 images (you can just stack them).

this is an example with 4 images:

4-batch

xiaohu2015 commented 1 year ago

@cubiq this is achieved by concat the image features of 4 images ( 16 x 4 = 64 tokens)? I see another blog https://civitai.com/articles/2345, I am not sure that if your implementation is same as that.

cubiq commented 1 year ago

yes exactly.

the blog you linked uses more or less the same code as mine.

jadechip commented 1 year ago

Could someone please provide an example of how to use multiple images outside of Comfy?

jadechip commented 1 year ago

Would something like this work perhaps? Is my understanding correct?

image_paths = ["img1", "img2", "img3", "img4"]
image_tensors = []

transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

for path in image_paths:
    image = Image.open(path)
    image_tensor = transform(image)
    image_tensors.append(image_tensor)

image_batch = torch.stack(image_tensors)
print(image_batch.shape)  # Should print torch.Size([4, 3, 224, 224])

images = ip_model.generate(pil_images=image_batch, num_samples=num_samples, num_inference_steps=30, seed=42)
cubiq commented 1 year ago

yes, stacking should work

jadechip commented 1 year ago

Hmm since the get_image_embeds method is designed work with PIL images, would something like this suffice? pil_images = [Image.open(path).resize((224, 224)) for path in image_paths]

jadechip commented 1 year ago

btw looks like Comfy is using this node type for image batching: https://github.com/comfyanonymous/ComfyUI/blob/213976f8c3ea3f45f0c692dd8aac2fd9fea433e3/nodes.py#L1490

class ImageBatch:

    @classmethod
    def INPUT_TYPES(s):
        return {"required": { "image1": ("IMAGE",), "image2": ("IMAGE",)}}

    RETURN_TYPES = ("IMAGE",)
    FUNCTION = "batch"

    CATEGORY = "image"

    def batch(self, image1, image2):
        if image1.shape[1:] != image2.shape[1:]:
            image2 = comfy.utils.common_upscale(image2.movedim(-1,1), image1.shape[2], image1.shape[1], "bilinear", "center").movedim(1,-1)
        s = torch.cat((image1, image2), dim=0)
        return (s,)
jadechip commented 1 year ago

So far my results are terrible 🥲

image
cubiq commented 1 year ago

that looks like you are sending the cond images to the uncond as well

hamzaakyildiz commented 1 year ago

when i try this method, output images are generated based on distinct images of the batch. However, as i understand output images should be a synthesis based on all the images in the batch. which is it ? If it is the second one how should i pass the images to the ip-adapter generate function ?

cubiq commented 1 year ago

please note that I work mainly with comfyui so I wasn't aware of the diffusers situation.

I had a quick look at the code and it seems that only the first image is prompted no matter how many images are sent. The image encoder works as expected and correctly encodes all the images so there must be some clipping happening somewhere down the line.

I'll work on a diffuser project soon so I might be looking into this

jadechip commented 1 year ago

Thanks @cubiq! If you have a github sponsor or kofi link I would be happy to help support development of this feature!

cubiq commented 1 year ago

I was able to implement image weighting in ComfyUI @xiaohu2015

In the image below you can see two different results using the same 2 images with different weights

weight_images

as always the code here https://github.com/cubiq/ComfyUI_IPAdapter_plus

Regarding Diffusers the thing is a bit more complicated, the current implementation offered by tencent-ailab is a bit too "rigid" and it would require some refactoring... a more dynamic approach would be for the library to only exports the embeds and then the user can integrate those into any pipeline (given the right combination of encoder/ipadapter model/main checkpoint). Alternatively the official API should be followed a bit more closely. I'll look into it in the coming days

cubiq commented 1 year ago

Okay good news, I was able to replicate all the comfyui features in diffusers.

On the left is the diffusers image that as you can see is very close to the image on the right generated with 2 images in ComfyUI (the difference in sharpness is caused by different sampling algorithms used for the image encoder).

diffusers-ipadapter

It's just a matter merging the image embeds and increasing the number or tokens in the attention processor. I'll do some code clean up and post the code somewhere!

jadechip commented 1 year ago

That is huge 💪 Can't wait to test it out!

cubiq commented 1 year ago

More info about the new code here #99

You can find the code and examples in my repo https://github.com/cubiq/Diffusers_IPAdapter

It's still a bit experimental, but should be enough to get you started. Have fun!