tencent-ailab / IP-Adapter

The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.
Apache License 2.0
5.09k stars 331 forks source link

Color problem #187

Open gagbaghdas opened 9 months ago

gagbaghdas commented 9 months ago

Hey guys. Anyone have an idea what I'm doing wrong ? Something is wrong with colors here (( Can't find the problem . Here are the initial, prompt, mask and the result images. As you can see the RED hoody become gray on the result :D

Screenshot 2023-12-20 at 21 37 20 Here is the part from my code. Method inPaintingUsingIPAdapter.

import torch
from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline, StableDiffusionInpaintPipelineLegacy, DDIMScheduler, AutoencoderKL, ControlNetModel, StableDiffusionControlNetPipeline, UniPCMultistepScheduler, AutoPipelineForInpainting
from PIL import Image
import os
from IPAdapter.ip_adapter.ip_adapter import IPAdapter
import requests
import numpy as np
from diffusers.utils import load_image

class IPAdapterProcessor:
    def __init__(self):
        self.noise_scheduler = DDIMScheduler(
            num_train_timesteps=1000,
            beta_start=0.00085,
            beta_end=0.012,
            beta_schedule="scaled_linear",
            clip_sample=False,
            set_alpha_to_one=False,
            steps_offset=1,
        )
        self.vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse").to(dtype=torch.float32)
        self.base_model_path =  "runwayml/stable-diffusion-v1-5"
        self.image_encoder_path = "IPAdapter/models/image_encoder/"
        self.ip_ckpt = "IPAdapter/models/ip-adapter_sd15.bin"
        self.device = "cpu"

    def inPaintingUsingIPAdapter(self, initial_image_url, prompt_image_url, mask_image_url, output_dir):

        pipe = StableDiffusionInpaintPipelineLegacy.from_pretrained(
            self.base_model_path,
            torch_dtype=torch.float32,
            scheduler=self.noise_scheduler,
            vae=self.vae,
            feature_extractor=None,
            safety_checker=None
        )

        ip_model = IPAdapter(pipe, self.image_encoder_path, self.ip_ckpt, self.device)
        initial_image = load_image(initial_image_url).resize((512, 768))
        initial_image.show()
        prompt_image = load_image(prompt_image_url).resize((512, 768))
        prompt_image.show()
        mask_image = load_image(mask_image_url).resize((512, 768))
        mask_image.show()

        images = ip_model.generate(pil_image=prompt_image, num_samples=1, num_inference_steps=50, seed=42, image=initial_image, mask_image=mask_image, strength=0.7)
        for i, img in enumerate(images):
           img.resize((512, 768)).show()

Any help would be appreciated.

Thanks in advance.

xiaohu2015 commented 9 months ago

you can try a higher strength

gagbaghdas commented 9 months ago

you can try a higher strength

Great @xiaohu2015 , thank you. 1 strength solve the problem.

Screenshot 2023-12-21 at 08 47 19

Now the result is 95% close to the original image :D . btw, you can see the slight diffs , is there anything I can do for getting the 100% same result? ))

xiaohu2015 commented 9 months ago

currently, it can not achive 100%, maybe you can train such a adapter for cloth

gagbaghdas commented 9 months ago

currently, it can not achive 100%, maybe you can train such a adapter for cloth

Thanks for the info. Do you know approx what kind of resources will I need to train it?

whiterose199187 commented 9 months ago

hello,

Also is there a training script for this? Can re-use existing ones?

xiaohu2015 commented 9 months ago

currently, it can not achive 100%, maybe you can train such a adapter for cloth

Thanks for the info. Do you know approx what kind of resources will I need to train it?

It's hard to say without doing experiments.

gagbaghdas commented 9 months ago

currently, it can not achive 100%, maybe you can train such a adapter for cloth

Thanks for the info. Do you know approx what kind of resources will I need to train it?

It's hard to say without doing experiments.

Thanks. And regarding the dataset, what do you think about DeepFashion2 Dataset?

xiaohu2015 commented 9 months ago

currently, it can not achive 100%, maybe you can train such a adapter for cloth

Thanks for the info. Do you know approx what kind of resources will I need to train it?

It's hard to say without doing experiments.

Thanks. And regarding the dataset, what do you think about DeepFashion2 Dataset?

I think it is OK

gagbaghdas commented 9 months ago

@xiaohu2015 Great, thank you. Then I'm going to try to train it on clothes, Will ping here in case of any questions or problems.

xiaohu2015 commented 9 months ago

OK

gagbaghdas commented 8 months ago

@xiaohu2015 I've started the training. But at some point it seems it interrupted. So here is my last checkpoint.

Screenshot 2024-01-23 at 23 43 00

And from the logs I can see the last step:

 Epoch 8, step 4988

So I need to continue the training right? if so ,how can I make so it would continue from the checkpoint? Should I just set the pre-trained-model-path to the last checkpoint ? Or there is anything else I should do?

xiaohu2015 commented 8 months ago
accelerator.print(f"Resuming from checkpoint {path}")
accelerator.load_state(os.path.join(args.output_dir, path)
gagbaghdas commented 8 months ago

@xiaohu2015 what about this approach ?

xiaohu2015 commented 8 months ago

@xiaohu2015 what about this approach ?

yes, it works

gagbaghdas commented 8 months ago

@xiaohu2015 btw , the starting model path is --pretrained_model_name_or_path="stable-diffusion-v1-5/" is it ok to fine tune the IP-Adapter on clothes? Or I should use the ip-adapter_sd15.bin as a pretrained model ? I mean maybe I'm doing something wrong and trying to train it from zero?

xiaohu2015 commented 8 months ago

@xiaohu2015 btw , the starting model path is --pretrained_model_name_or_path="stable-diffusion-v1-5/" is it ok to fine tune the IP-Adapter on clothes? Or I should use the ip-adapter_sd15.bin as a pretrained model ? I mean maybe I'm doing something wrong and trying to train it from zero?

hi, for clothes, I think you should use ip-adapter-full model (using all token features of CLIP or DINO)

gagbaghdas commented 8 months ago

@xiaohu2015 so you mean the --pretrained_model_name_or_path should point to ip-adapter-full instead of stable-diffusion-v1-5/ ?

xiaohu2015 commented 8 months ago

@xiaohu2015 so you mean the --pretrained_model_name_or_path should point to ip-adapter-full instead of stable-diffusion-v1-5/ ?

I mean this ip-adpater: https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter/ip_adapter.py#L316

gagbaghdas commented 8 months ago

@xiaohu2015 so you mean the --pretrained_model_name_or_path should point to ip-adapter-full instead of stable-diffusion-v1-5/ ?

I mean this ip-adpater: https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter/ip_adapter.py#L316

Sorry for "stupid" questions, I'm new to this (( So you mean I should use the full model in my training code? Like instead of this?

class IPAdapter(torch.nn.Module):
    """IP-Adapter"""
    def __init__(self, unet, image_proj_model, adapter_modules, ckpt_path=None):
        super().__init__()
        self.unet = unet
        self.image_proj_model = image_proj_model
        self.adapter_modules = adapter_modules

        if ckpt_path is not None:
            self.load_from_checkpoint(ckpt_path)

    def forward(self, noisy_latents, timesteps, encoder_hidden_states, image_embeds):
        ip_tokens = self.image_proj_model(image_embeds)

        encoder_hidden_states = torch.cat([encoder_hidden_states, ip_tokens], dim=1)
        # Predict the noise residual
        noise_pred = self.unet(noisy_latents, timesteps, encoder_hidden_states).sample
        return noise_pred

    def load_from_checkpoint(self, ckpt_path: str):
        # Calculate original checksums
        orig_ip_proj_sum = torch.sum(torch.stack([torch.sum(p) for p in self.image_proj_model.parameters()]))
        orig_adapter_sum = torch.sum(torch.stack([torch.sum(p) for p in self.adapter_modules.parameters()]))

        state_dict = torch.load(ckpt_path, map_location="cpu")

        # Load state dict for image_proj_model and adapter_modules
        self.image_proj_model.load_state_dict(state_dict["image_proj"], strict=True)
        self.adapter_modules.load_state_dict(state_dict["ip_adapter"], strict=True)

        # Calculate new checksums
        new_ip_proj_sum = torch.sum(torch.stack([torch.sum(p) for p in self.image_proj_model.parameters()]))
        new_adapter_sum = torch.sum(torch.stack([torch.sum(p) for p in self.adapter_modules.parameters()]))

        # Verify if the weights have changed
        assert orig_ip_proj_sum != new_ip_proj_sum, "Weights of image_proj_model did not change!"
        assert orig_adapter_sum != new_adapter_sum, "Weights of adapter_modules did not change!"

        print(f"Successfully loaded weights from checkpoint {ckpt_path}")
xiaohu2015 commented 8 months ago

@xiaohu2015 so you mean the --pretrained_model_name_or_path should point to ip-adapter-full instead of stable-diffusion-v1-5/ ?

I mean this ip-adpater: https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter/ip_adapter.py#L316

Sorry for "stupid" questions, I'm new to this (( So you mean I should use the full model in my training code? Like instead of this?

class IPAdapter(torch.nn.Module):
    """IP-Adapter"""
    def __init__(self, unet, image_proj_model, adapter_modules, ckpt_path=None):
        super().__init__()
        self.unet = unet
        self.image_proj_model = image_proj_model
        self.adapter_modules = adapter_modules

        if ckpt_path is not None:
            self.load_from_checkpoint(ckpt_path)

    def forward(self, noisy_latents, timesteps, encoder_hidden_states, image_embeds):
        ip_tokens = self.image_proj_model(image_embeds)

        encoder_hidden_states = torch.cat([encoder_hidden_states, ip_tokens], dim=1)
        # Predict the noise residual
        noise_pred = self.unet(noisy_latents, timesteps, encoder_hidden_states).sample
        return noise_pred

    def load_from_checkpoint(self, ckpt_path: str):
        # Calculate original checksums
        orig_ip_proj_sum = torch.sum(torch.stack([torch.sum(p) for p in self.image_proj_model.parameters()]))
        orig_adapter_sum = torch.sum(torch.stack([torch.sum(p) for p in self.adapter_modules.parameters()]))

        state_dict = torch.load(ckpt_path, map_location="cpu")

        # Load state dict for image_proj_model and adapter_modules
        self.image_proj_model.load_state_dict(state_dict["image_proj"], strict=True)
        self.adapter_modules.load_state_dict(state_dict["ip_adapter"], strict=True)

        # Calculate new checksums
        new_ip_proj_sum = torch.sum(torch.stack([torch.sum(p) for p in self.image_proj_model.parameters()]))
        new_adapter_sum = torch.sum(torch.stack([torch.sum(p) for p in self.adapter_modules.parameters()]))

        # Verify if the weights have changed
        assert orig_ip_proj_sum != new_ip_proj_sum, "Weights of image_proj_model did not change!"
        assert orig_adapter_sum != new_adapter_sum, "Weights of adapter_modules did not change!"

        print(f"Successfully loaded weights from checkpoint {ckpt_path}")

no, here "full" means use all token features of clip

gagbaghdas commented 8 months ago

So you mean insteaed of this

 #ip-adapter
    image_proj_model = ImageProjModel(
        cross_attention_dim=unet.config.cross_attention_dim,
        clip_embeddings_dim=image_encoder.config.projection_dim,
        clip_extra_context_tokens=4,
    )
   ip_adapter = IPAdapter(unet, image_proj_model, adapter_modules, args.pretrained_ip_adapter_path)

I should use this one?

class IPAdapterFull(IPAdapterPlus):
    """IP-Adapter with full features"""

    def init_proj(self):
        image_proj_model = MLPProjModel(
            cross_attention_dim=self.pipe.unet.config.cross_attention_dim,
            clip_embeddings_dim=self.image_encoder.config.hidden_size,
        ).to(self.device, dtype=torch.float32)
        return image_proj_model

Sorry again for this kind of questions, I want to be sure 100% I'm not doing anything wrong :D

xiaohu2015 commented 8 months ago

yes

gagbaghdas commented 8 months ago

yes

Great, thanks. I'll get back here with result ( hopefully ) or questions :D

gagbaghdas commented 8 months ago

@xiaohu2015 I'm a bit stuck here, using IPAdapterFull from the ip-adapter requires SD Pipeline , but my current training code doesn't use SD pipeline. Can I share my current training code with you? So you can give me feedback on what can be changed or improved?

xiaohu2015 commented 8 months ago

@xiaohu2015 I'm a bit stuck here, using IPAdapterFull from the ip-adapter requires SD Pipeline , but my current training code doesn't use SD pipeline. Can I share my current training code with you? So you can give me feedback on what can be changed or improved?

hi, I mean you can use this projection net (https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter/ip_adapter.py#L316) to extract features from CLIP model, and the features are then used as the keys and values of cross-attention layers of IP-Adapter.

kadirnar commented 6 months ago

@gagbaghdas ,

Hi, I want to train with the VITON-HD dataset. I trained the model, but it gives an error when testing it. Have you tested it?

Remosy commented 6 months ago

Thank you guys, im in, gonna try it too

hongminpark commented 5 months ago

@gagbaghdas Hi! I was trying to train my own IP adapter for clothes. How was your result ? Was it successful ? I would appreciate if you can share your experiment :) :)

masaisai111 commented 5 months ago

Do you use random noise in your reasoning, and if I want to replace random noise with a noisy photo what should I do

gagbaghdas commented 5 months ago

@gagbaghdas Hi! I was trying to train my own IP adapter for clothes. How was your result ? Was it successful ? I would appreciate if you can share your experiment :) :)

Hey, unfortunately no, there were some issues especially with different body sizes , and I switched to another thing

wangzhen-ing commented 1 month ago

@xiaohu2015 btw , the starting model path is --pretrained_model_name_or_path="stable-diffusion-v1-5/" is it ok to fine tune the IP-Adapter on clothes? Or I should use the ip-adapter_sd15.bin as a pretrained model ? I mean maybe I'm doing something wrong and trying to train it from zero?

hi, for clothes, I think you should use ip-adapter-full model (using all token features of CLIP or DINO)

Hello, I have a question for you, why IPAdapterFull can be used to extract more detailed information of clothes. IPAdapterFull is simpler than IPAdapterPlus (Resampler) model, which only has MLPProjModel. Thanks!