lucidrains / perfusion-pytorch

Implementation of Key-Locked Rank One Editing, from Nvidia AI
MIT License
231 stars 9 forks source link

Any plan for training code? #5

Open shileims opened 1 year ago

shileims commented 1 year ago

Hi Author, This is a really amazing repo. I check the progress every day and hope to get it to try. Do you have an estimation deadline for finishing the training code? Thank you so much!

shileims commented 11 months ago

image

Sometimes, I got image result like this...........

@lucidrains Just kindly ask, any chance to get a code review from you?

lucidrains commented 11 months ago

@shileims haha! what concepts did you finetune?

yea, happy to code review this weekend, please do the same as @irowberry and share your code online

shileims commented 11 months ago

Hi @lucidrains , Really appreciate your code review, that will be very helpful. This is the repo link: https://github.com/shileims/Finetuning/tree/main, main.py is the entrance file. some results are in ckpts folder. The training concept is "teddy".............

yoshibenjo commented 11 months ago

as a side note, i'm more than happy to donate compute to train this model once this review is ready.

BradVidler commented 11 months ago

Any update on this?

BradVidler commented 11 months ago

@irowberry I have been testing your code. I think there's something weird happening in the inference. I'm using the following code to test before and after, using a seed.

from perfusion_pytorch.embedding import OpenClipEmbedWrapper, EmbeddingWrapper
from perfusion_pytorch.save_load import load
from perfusion_pytorch import Rank1EditModule
import torch
from diffusers import StableDiffusionPipeline

device = "cuda:0"
generator = torch.Generator(device="cuda").manual_seed(12345)
prompts = ["photo of a person, clear face, high quality"]
superclass_string = 'person'

pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", requires_safety_checker=False)
pipe.to(device)
image = pipe(prompts[0], num_inference_steps=30, guidance_scale=7.5,generator=generator).images[0]
image.save("test-before.png")

pipe.to("cpu")
perfusion_model = PerfusionModel(pipe.unet, pipe.text_encoder, pipe.tokenizer, superclass_string).to(device).requires_grad_(False)
load(perfusion_model, 'me_concept.pt')
perfusion_model.eval()

pipe.text_encoder.to("cpu")
wrapped_clip_with_new_concept = OpenClipEmbedWrapper(
    pipe.text_encoder,
    tokenizer = pipe.tokenizer,
    superclass_string = superclass_string
)

text_enc, superclass_enc, mask, indices = wrapped_clip_with_new_concept(prompts)

pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", requires_safety_checker=False)
pipe.unet = perfusion_model.unet
pipe.text_encoder = perfusion_model.clip_model
pipe.to(device)

image = pipe(prompts[0], num_inference_steps=30, guidance_scale=7.5,generator=generator,cross_attention_kwargs={"concept_indices": indices, "text_enc_with_superclass": superclass_enc}).images[0]
image.save("test-after.png")

Unfortunately the results don't look like the subject (me) so I think I am missing something, or something is going wrong during training.

I tested @shileims code briefly as well and it also doesn't seem to be training the subject into the model unless I am missing something there as well.

One theory: It seems like we are using the same concept and superclass string, eg. person, so when we train on "person", we don't end up seeing much of a change because that token is so common. If for example we could train on bradvidler and then and link that to the superclass person, that would work better since bradvidler is a rare token. I think in the original Dreambooth paper they suggest using a rare token, 3-4 characters, eg. ohwx. In the perfusion paper they use eg. Hugsy -> Teddy.

If @lucidrains is able to comment on this that would be much appreciated. I am specifically looking for clarity on the earlier comment here https://github.com/lucidrains/perfusion-pytorch/issues/5#issuecomment-1694393325. Thanks in advance!

irowberry commented 11 months ago

@BradVidler are you getting somewhat coherent images out? Mine are always abstract shapes. I followed code straight from Hugging Face for inference, which is found here I'll keep working on it.

Okay, I am not sure what I'm doing wrong, but these are the sorts of images I'm getting. image3 image0 image1-2

BradVidler commented 11 months ago

The images with the code I posted look like normal SD1.5 photos they just don't resemble the subject I am training. I've tried training up to 5000 steps and still don't see any difference. Another thing I've noticed is that the saved weights are 2.5MB as opposed to the desired 100kb. In @shileims repo the weights are saved at 200kb which seems a lot closer.

I will try digging in again when I have some time.

irowberry commented 11 months ago

@BradVidler Yeah, that's interesting, my goal is to have it work with Hugging Face models, but I am getting similar results to you using your inference code, it seems that it isn't picking up on the concepts. I'll have to see what is wrong with my inference code, not sure why it isn't working.

BradVidler commented 11 months ago

Yes exactly. Making it work with HuggingFace is definitely ideal. My inference code is using HuggingFace's diffusers library directly whereas your code is implementing the under the hood code, rather than using the diffusers library which takes care of all that automatically.

Have you tested your inference code on default SD1.5 without any finetuning? Wondering if you get normal results that way.

EDIT: I believe I had a mistake in my inference code. I was passing in the non finetuned clip model into the OpenClipEmbedWrapper. After 2000 steps of training it kind of seems like it's getting closer. Here is I think the correct code:

wrapped_clip_with_new_concept = OpenClipEmbedWrapper( perfusion_model.clip_model, tokenizer = pipe.tokenizer, superclass_string = superclass_string )

Also wondering if there is a way to specify batch size in the training. If it's defaulting to a batch size of 1 we might need to train a lot longer.

EDIT: Found the batch size but cannot increase it without running out of memory, probably due to have a Diffusers pipeline in memory.

I modified the inference function in your training code like so:

def inference(model):
    pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", requires_safety_checker=False)
    generator = torch.Generator(device="cuda").manual_seed(12345)
    prompts = ["photo of a person"]
    superclass_string = 'person'
    save_load.save(model, 'me_concept.pt')  
    inference_model = PerfusionModel(pipe.unet, pipe.text_encoder, pipe.tokenizer, superclass_string).to(device).requires_grad_(False)
    save_load.load(inference_model, 'me_concept.pt')
    inference_model.eval()
    inference_model.to("cpu").requires_grad_(False)
    wrapped_clip_with_new_concept = OpenClipEmbedWrapper(
        inference_model.clip_model,
        tokenizer = pipe.tokenizer,
        superclass_string = superclass_string
    )

    text_enc, superclass_enc, mask, indices = wrapped_clip_with_new_concept(prompts)

    pipe.unet = inference_model.unet
    pipe.text_encoder = inference_model.clip_model
    pipe.to(device)

    image = pipe(prompts[0], num_inference_steps=30, guidance_scale=7.5,generator=generator,cross_attention_kwargs={"concept_indices": indices, "text_enc_with_superclass": superclass_enc}).images[0]

    return image

and I am outputting a sample every 250 steps. I ran training for 8000 steps, using the same seed, and the output changes but it never ends up looking like the subject.

BradVidler commented 11 months ago

Another theory: your code is using EmbeddingWrapper, which doesn't create a new nn.identity for the new concept, whereas OpenClipEmbedWrapper does. Might be something to look into.

irowberry commented 11 months ago

I've discovered something interesting. Using the pipeline, there are two keyword arguments prompts and prompt_embeds. When using the prompts (text) I get an image that is a dog (the superclass) but not the one I trained it on. When using prompt_embeds I get a garbage image out, (it often triggers the NSFW detection filter for some reason). Here's my inference code. I'm not sure what is causing the difference, but I'll keep exploring

Update: If you pass in text to the pipe, it never calls any of the EmbeddingWrapper layer code.

def inference(prompts, pipe, model):

    embeds_with_new_concept, embeds_with_superclass, embed_mask, concept_indices = model.wrapped_embeds(prompts)
    enc_with_new_concept = model.clip_model.text_model.encoder(embeds_with_new_concept).last_hidden_state

    if embeds_with_superclass is not None:
        enc_with_superclass = model.clip_model.text_model.encoder(embeds_with_superclass).last_hidden_state
    else:
        enc_with_superclass = embeds_with_superclass

    pipe.unet = model.unet
    pipe.text_encoder = model.clip_model
    pipe.to(device)
    images = pipe(prompt_embeds=enc_with_new_concept,
                 num_inference_steps=30, 
                 guidance_scale=7.5, 
                 cross_attention_kwargs={"concept_indices": concept_indices, "text_enc_with_superclass": enc_with_superclass}).images
    return images
BradVidler commented 11 months ago

Not having any luck here. Have also been testing @shileims repo to no avail. In that one I noticed it was using an outdated version of perfusion so I updated that and then had to make some slight modifications to make it work again. Trained for an hour on my 4090 and results don't look like the subject. I have also tried increasing the learning rates x10 and still got the same result.

Not too sure where to go from here.

irowberry commented 11 months ago

Yeah, me neither. I think the cross attention modules are training well, because the model still generates coherent pictures of whatever the superclass is, when I pass in text through the regular encoder. When I pass in the encodings/embeddings from the EmbeddingWrapper module into the pipe is when I get images of just essentially noise. @lucidrains any thoughts?

BradVidler commented 11 months ago

I spoke to Kohaku and they said that perfusion is incredibly complex and might require patching a lot of stuff to get the inference to work properly. They also suggested looking into pivotal tuning/rank1 lora which is implemented in hcpdiff so I'm changing focus to that.

If that doesn't work out I may commission someone to write the perfusion trainer.

Feel free to reach out to me on Discord if you'd like to colab more. User name is Viddleshtix.

lucidrains commented 11 months ago

Yeah, me neither. I think the cross attention modules are training well, because the model still generates coherent pictures of whatever the superclass is, when I pass in text through the regular encoder. When I pass in the encodings/embeddings from the EmbeddingWrapper module into the pipe is when I get images of just essentially noise. @lucidrains any thoughts?

sorry, been busy with other work

you should reach out to the author and see if he's willing to do a code review. i'm sure in the interest of his paper, he would want to see it reproduced in the wild

IndolentKheper commented 10 months ago

So nobody's been able to get this working, huh? A shame, a lot of us were hoping this would be the best new training method.

PaulToast commented 10 months ago

So nobody's been able to get this working, huh? A shame, a lot of us were hoping this would be the best new training method.

Heyo, fairly inexperienced research intern here, I've been working on this for the past 1-2 weeks. My main starting point was this repo to get things going as quickly as possible (i'm not really capable enough to set up my own training code from scratch yet).

I managed to get the training going, but with "mixed results" to say the least, so I'd love to have this whole discussion active again too. ^^ Ran into quite a few issues and i dont have the best hardware available either, but yeah see if that repo i linked is of any help.

irowberry commented 10 months ago

@PaulToast you can check out my repo and see if you can get it working. I might start working on it again. I think there is something wrong with the TextEmbedding wrapper, as the fine-tuned model generates normal images when using a non-trained text encoder.

IndolentKheper commented 10 months ago

So nobody's been able to get this working, huh? A shame, a lot of us were hoping this would be the best new training method.

Heyo, fairly inexperienced research intern here, I've been working on this for the past 1-2 weeks. My main starting point was this repo to get things going as quickly as possible (i'm not really capable enough to set up my own training code from scratch yet).

I managed to get the training going, but with "mixed results" to say the least, so I'd love to have this whole discussion active again too. ^^ Ran into quite a few issues and i dont have the best hardware available either, but yeah see if that repo i linked is of any help.

Yeah I did check out that repo a while back, some of the same folks like shilheims were trying it but nobody's seemed to get anything useful working, then updates stopped, much like this one. I'm kind of skeptical about both key-locking papers regarding how much this is actually teaching a unique concept, versus image/token referencing and iterating, a la IP-adapter, from which I've gotten results very similar to input images. Is key-locking even that much different than IP-adapter, at this point?