Alternative implementation in Refiners

We are building Refiners, an open source, PyTorch-based framework made to easily train and run adapters on top of foundational models. Just wanted to let you know that IP-Adapter is now fully supported in Refiners! (congrats on the great work, by the way!!)

E.g. an equivalent to the "IP-Adapter with fine-grained features" demo would look like this:

Follow these install steps
Run the code snippet below which gives:

output

import torch

from PIL import Image

from refiners.foundationals.latent_diffusion import StableDiffusion_1, SD1IPAdapter
from refiners.foundationals.latent_diffusion.schedulers import DDIM
from refiners.fluxion.utils import load_from_safetensors, manual_seed

device = "cuda"
image = Image.open("statue.png")

ddim_scheduler = DDIM(num_inference_steps=50)
sd15 = StableDiffusion_1(scheduler=ddim_scheduler, device="cuda", dtype=torch.float16)
sd15.clip_text_encoder.load_from_safetensors("clip_text.safetensors")
sd15.lda.load_from_safetensors("lda.safetensors")
sd15.unet.load_from_safetensors("unet.safetensors")

with torch.no_grad():
    prompt = "best quality, high quality, wearing a hat on the beach"
    negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"

    ip_adapter = SD1IPAdapter(
        target=sd15.unet,
        weights=load_from_safetensors("ip-adapter-plus_sd15.safetensors"),
        fine_grained=True,
        scale=0.6,
    )
    ip_adapter.clip_image_encoder.load_from_safetensors("clip_image.safetensors")
    ip_adapter.inject()

    clip_text_embedding = sd15.compute_clip_text_embedding(text=prompt, negative_text=negative_prompt)
    clip_image_embedding = ip_adapter.compute_clip_image_embedding(ip_adapter.preprocess_image(image))

    negative_text_embedding, conditional_text_embedding = clip_text_embedding.chunk(2)
    negative_image_embedding, conditional_image_embedding = clip_image_embedding.chunk(2)

    clip_text_embedding = torch.cat(
        (
            torch.cat([negative_text_embedding, negative_image_embedding], dim=1),
            torch.cat([conditional_text_embedding, conditional_image_embedding], dim=1),
        )
    )

    manual_seed(42)
    x = torch.randn(1, 4, 64, 64, device=device, dtype=torch.float16)

    for step in sd15.steps:
        x = sd15(
            x,
            step=step,
            clip_text_embedding=clip_text_embedding,
            condition_scale=7.5,
        )
    predicted_image = sd15.lda.decode_latents(x)

predicted_image.save("output.png")
print("done: see output.png")

Note: other variants of IP-Adapter are supported too (SDXL, with or without fine-grained features)

A few more things:

SD1IPAdapter implements the IP-Adapter logic: it “targets” the UNet on which it can be injected (= all cross-attentions are replaced with the decoupled cross-attentions) or ejected (= get back to the original UNet)
It builds upon Refiners’ Adapter API
Adapters can be combined e.g. in the tests we showcase how to combine IP-Adapter with Controlnet.

Feedback welcome!

tencent-ailab / IP-Adapter

Alternative implementation in Refiners #92