huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
24.22k stars 5k forks source link

IP-Adapter FaceID PLus How to use questions #7766

Open Honey-666 opened 3 months ago

Honey-666 commented 3 months ago

https://github.com/huggingface/diffusers/blob/9ef43f38d43217f690e222a4ce0239c6a24af981/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L492

error msg:

pipe.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
AttributeError: 'list' object has no attribute 'to'

hi! I'm having some problems using the ip adapter FaceID PLus. Can you help me answer these questions? Thank you very much

  1. first question: What should I pass in the ip_adapter_image parameter in the prepare_ip_adapter_image_embeds function
  2. second question: What problem does this cause when the following code does not match in the merge code link below and in the example in the ip_adapter.md file this is merge link: https://github.com/huggingface/diffusers/pull/7186#issuecomment-1986961595 Differential code:
      ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0)
      neg_ref_images_embeds = torch.zeros_like(ref_images_embeds)
      id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda"))

    @yiyixuxu @fabiorigano

os:

diffusers==diffusers-0.28.0.dev0

this is my code:

# @FileName:StableDiffusionIpAdapterFaceIDTest.py
# @Description:
# @Author:dyh
# @Time:2024/4/24 11:45
# @Website:www.xxx.com
# @Version:V1.0
import cv2
import numpy as np
import torch
from PIL import Image
from diffusers import StableDiffusionPipeline
from insightface.app import FaceAnalysis
from transformers import CLIPVisionModelWithProjection

model_path = '../../../aidazuo/models/Stable-diffusion/stable-diffusion-v1-5'
clip_path = '../../../aidazuo/models/CLIP-ViT-H-14-laion2B-s32B-b79K'
ip_adapter_path = '../../../aidazuo/models/IP-Adapter-FaceID'
ip_img_path = '../../../aidazuo/jupyter-script/test-img/vermeer.png'

def extract_face_features(image_lst: list, input_size: tuple):
    # Extract Face features using insightface
    ref_images = []
    app = FaceAnalysis(name="buffalo_l",
                       root=ip_adapter_path,
                       providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])

    app.prepare(ctx_id=0, det_size=input_size)
    for img in image_lst:
        image = cv2.cvtColor(np.asarray(img), cv2.COLOR_BGR2RGB)
        faces = app.get(image)
        image = torch.from_numpy(faces[0].normed_embedding)
        ref_images.append(image.unsqueeze(0))
    ref_images = torch.cat(ref_images, dim=0)

    return ref_images

ip_adapter_img = Image.open(ip_img_path)

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    clip_path,
    torch_dtype=torch.float16,
    use_safetensors=True
)

pipe = StableDiffusionPipeline.from_pretrained(
    model_path,
    variant="fp16",
    safety_checker=None,
    image_encoder=image_encoder,
    torch_dtype=torch.float16).to("cuda")

adapter_file_lst = ["ip-adapter-faceid-plus_sd15.bin"]
adapter_weight_lst = [0.5]

pipe.load_ip_adapter(ip_adapter_path, subfolder=None, weight_name=adapter_file_lst)
pipe.set_ip_adapter_scale(adapter_weight_lst)

face_id_embeds = extract_face_features([ip_adapter_img], ip_adapter_img.size)

clip_embeds = pipe.prepare_ip_adapter_image_embeds(ip_adapter_image=[ip_adapter_img],
                                                   ip_adapter_image_embeds=None,
                                                   device='cuda',
                                                   num_images_per_prompt=1,
                                                   do_classifier_free_guidance=True)

pipe.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
pipe.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False  # True if Plus v2

generator = torch.manual_seed(33)
images = pipe(
    prompt='a beautiful girl',
    ip_adapter_image_embeds=clip_embeds,
    negative_prompt="",
    num_inference_steps=30,
    num_images_per_prompt=1,
    generator=generator,
    width=512,
    height=512).images

print(images)
fabiorigano commented 3 months ago

hi,

  1. please refer to documentation, here you have the link to the face models. can you try the following code?

    clip_embeds = pipeline.prepare_ip_adapter_image_embeds(
                [ip_adapter_images], None, torch.device("cuda"), num_images, True)[0]
  2. if you use CFG (classifier-free guidance), you must provide both neg_ref_images_embeds and ref_images_embeds. in the original implementation this is the default behaviour

Honey-666 commented 2 months ago

hi,

  1. please refer to documentation, here you have the link to the face models. can you try the following code?
clip_embeds = pipeline.prepare_ip_adapter_image_embeds(
                [ip_adapter_images], None, torch.device("cuda"), num_images, True)[0]
  1. if you use CFG (classifier-free guidance), you must provide both neg_ref_images_embeds and ref_images_embeds. in the original implementation this is the default behaviour

1、 ok! I successfully passed the test demo, but the test case seems to have an extra parenthesis in this line of code this code: id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda"))

And when I modified this test code to the plus version, he reported the following error:

  File "C:\work\pythonProject\demo01\venv\lib\site-packages\torch\nn\modules\conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [512]

This is my revised code:

import cv2
import numpy as np
import torch
from PIL import Image
from diffusers import StableDiffusionPipeline, DDIMScheduler
from insightface.app import FaceAnalysis
from transformers import CLIPVisionModelWithProjection

model_path = '../../../aidazuo/models/Stable-diffusion/stable-diffusion-v1-5'
clip_path = '../../../aidazuo/models/CLIP-ViT-H-14-laion2B-s32B-b79K'
ip_adapter_path = '../../../aidazuo/models/IP-Adapter-FaceID'
ip_img_path = '../../../aidazuo/jupyter-script/test-img/ip_mask_girl1.png'

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    clip_path,
    torch_dtype=torch.float16,
    use_safetensors=True
)

pipeline = StableDiffusionPipeline.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    image_encoder=image_encoder
).to("cuda")
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_ip_adapter(ip_adapter_path, subfolder=None, weight_name="ip-adapter-faceid-plus_sd15.bin",
                         image_encoder_folder=None)
pipeline.set_ip_adapter_scale(0.6)

image = Image.open(ip_img_path)

ref_images_embeds = []
app = FaceAnalysis(name="buffalo_l", root=ip_adapter_path, providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
app.prepare(ctx_id=0, det_size=(640, 640))
image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB)
faces = app.get(image)
image = torch.from_numpy(faces[0].normed_embedding)
ref_images_embeds.append(image.unsqueeze(0))
ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0)
neg_ref_images_embeds = torch.zeros_like(ref_images_embeds)
id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda")

generator = torch.Generator(device="cpu").manual_seed(42)

clip_embeds = pipeline.prepare_ip_adapter_image_embeds([image], None, torch.device("cuda"), 1, True)[0]

pipeline.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False  # True if Plus v2

images = pipeline(
    prompt="A photo of a girl",
    ip_adapter_image_embeds=[id_embeds],
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
    num_inference_steps=20, num_images_per_prompt=1,
    generator=generator
).images

2、Does CFG refer to the "guidance_scale" parameter? It always seems to have a value, and if its value is 0, don't we need to add these two lines of code?

fabiorigano commented 2 months ago

thank you for spotting the error, it seems there is another one, I will fix documentation in a future PR

I forgot to upload the correct preprocessing for Face ID plus model:

from insightface.utils import face_align

ref_images_embeds = []
ip_adapter_images = []
app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
app.prepare(ctx_id=0, det_size=(640, 640))
image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB)
faces = app.get(image)
ip_adapter_images.append(face_align.norm_crop(image, landmark=faces[0].kps, image_size=224))
image = torch.from_numpy(faces[0].normed_embedding)
ref_images_embeds.append(image.unsqueeze(0))
ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0)
neg_ref_images_embeds = torch.zeros_like(ref_images_embeds)
id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda")

generator = torch.Generator(device="cpu").manual_seed(42)

clip_embeds = pipeline.prepare_ip_adapter_image_embeds([ip_adapter_images], None, torch.device("cuda"), 1, True)[0]

pipeline.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False 
  1. for the Face ID models we have to prepare the inputs before passing them to the pipeline, so you have to create it as written in the example code
Honey-666 commented 2 months ago

thank you for spotting the error, it seems there is another one, I will fix documentation in a future PR

I forgot to upload the correct preprocessing for Face ID plus model:

from insightface.utils import face_align

ref_images_embeds = []
ip_adapter_images = []
app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
app.prepare(ctx_id=0, det_size=(640, 640))
image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB)
faces = app.get(image)
ip_adapter_images.append(face_align.norm_crop(image, landmark=faces[0].kps, image_size=224))
image = torch.from_numpy(faces[0].normed_embedding)
ref_images_embeds.append(image.unsqueeze(0))
ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0)
neg_ref_images_embeds = torch.zeros_like(ref_images_embeds)
id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda")

generator = torch.Generator(device="cpu").manual_seed(42)

clip_embeds = pipeline.prepare_ip_adapter_image_embeds([ip_adapter_images], None, torch.device("cuda"), 1, True)[0]

pipeline.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False 
  1. for the Face ID models we have to prepare the inputs before passing them to the pipeline, so you have to create it as written in the example code

With the new preprocessing method described above I have been able to pass the PLus test. Thank you very much for your answer!

jfischoff commented 1 month ago

@fabiorigano does this code work with loading multiple different ip adapters without restriction?

For instance if I want to load a face plus v1 and v2 adapter is that possible? I would assume not because how can I set

pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False 

per adapter.

Additionally it is unclear to me how to have a collection face id and none face adapters. Is that supported?

fabiorigano commented 1 month ago

Hi @jfischoff You should be able to load both Face ID Plus models. You should pass a list with their names to the load_ip_adapter method:

pipeline.load_ip_adapter("h94/IP-Adapter-FaceID", subfolder=None, weight_name=["ip-adapter-faceid-plus_sd15.bin", "ip-adapter-faceid-plusv2_sd15.bin"])

Then, just for the second element of the projection layer list: pipeline.unet.encoder_hid_proj.image_projection_layers[1].shortcut = True

jfischoff commented 1 month ago

Thanks for the response @fabiorigano.

So should I set

pipeline.unet.encoder_hid_proj.image_projection_layers[i].clip_embeds = faceid_clip_embeds[i]
pipeline.unet.encoder_hid_proj.image_projection_layers[i].shortcut = is_v2[i]

for each face ip adapter?

Is it a problem if I have loaded a mix of non-faceid ip adapters and face id adapters? Does that affect the index I need to use in image_projection_layers or is image_projection_layers only used by the faceid ip adapters? Should I set the clip_embeds for non-faceid plus models as well?

What about how I pass images/embed to the pipeline when I have a mix of face id and non-faceid adapters? If I'm using a faceid model, should I include the embeddings in the same are when calling the pipeline?

fabiorigano commented 1 month ago

yes, that's correct

Each ip adapter passed in the list to the load_ip_adapter method has its corresponding image_projection_layers module, so be sure to index the correct one :)

the clip_embeds attribute is only needed for Face ID Plus models, because these adapters (v1 and v2) were trained with both CLIP image embeddings and insightface embeddings.

You can combine different IP adapters; I have tested some combinations. As anticipated above, it is not necessary to set CLIP embeddings to the other image projection modules, and you would get an error because the clip_embeds attribute doesn't exist in the other image projection classes.