Any plan for training code?

shileims commented 1 year ago

Hi Author, This is a really amazing repo. I check the progress every day and hope to get it to try. Do you have an estimation deadline for finishing the training code? Thank you so much!

lucidrains commented 1 year ago

hey yea, i'll circle back to it

you should give it a try though. hard part is done and all the pieces are there, for both single and multi-concept

IndolentKheper commented 1 year ago

Would you mind giving a little advice on how to run it for us non-developers? Like I understand how to install and all that, but actually running the training isn't going to be easy for me and a lot of others to figure out on our own.

BradVidler commented 1 year ago

+1 as I am also very interested if you have the time. I'm mostly wondering the broad order of what has to be done, and how we can go about saving the "100kb file" and apply it to a diffusers pipeline.

lucidrains commented 1 year ago

i would recommend starting with xiao's dreambooth / SD repository first, since that is what the authors used. Yoad also offered to further debug once we had something wired together

at this point, you'll still need to find a developer / researcher / ML engineer. i'll try to get it to the point where it is just a command line when i find time

lucidrains commented 1 year ago

broad order, for any entry level ML engineer who is looking to up their skills

~1. run a subset of laion dataset texts through the function for generating the input covariance~ ~2. determine the super concept id (must be 1 token id) from embedding table~

wrap the open CLIP instance with OpenClipEmbedWrapper, passing it also your superclass_string ~4. modify the BPE string to id mapping for your new *concept*~
wrap all keys and values in the cross attentions with the Rank1EditModule
pass the outputs from the OpenClipEmbedWrapper through the cross attention, and into the Rank1EditModule from step 5
use get_finetune_optimizer on your wired up stable diffusion instance. it should automatically extract only the small number of finetuneable parameters
the 100kb of data per fine tuned concepts can be extracted and restored with save and load functions (just pass it in the wired stable diffusion instance)

lucidrains commented 1 year ago

i realize there's still a lot of steps, i'll try to do a bit more damage next week. just writing that up, i realize for number 6, i could also take care of the whole fetching the embedding with the concept substituted with super class concept in the embedding wrapper.

lucidrains commented 1 year ago

actually, i'll take care of 6 tomorrow morning. that's probably the remaining tool needed before operating on SD

lucidrains commented 1 year ago

the input covariance also shouldn't change, so perhaps someone can run this for the clip being used in SD, and we can just check in the tensor, use it as a default. that would save us from passing around this extra matrix

BradVidler commented 1 year ago

Here's what I got for the input covariance by taking the first 100k captions from the LAION 400m dataset and running it through the function (perhaps it can be verified in the future): tensor([[ 9.3166e+00, -4.6043e-02, -1.1625e-01, ..., 6.4781e-02, -1.4543e+00, 8.6031e-01], [-4.6043e-02, 1.0780e+01, -2.8464e-01, ..., 7.0342e-01, -2.0617e-01, -3.6634e-01], [-1.1625e-01, -2.8464e-01, 1.0130e+01, ..., 3.4655e-01, -2.7999e-01, 2.1443e-03], ..., [ 6.4781e-02, 7.0342e-01, 3.4655e-01, ..., 1.0315e+01, -6.7360e-01, 1.0505e+00], [-1.4543e+00, -2.0617e-01, -2.7999e-01, ..., -6.7360e-01, 1.2800e+01, -2.2284e+00], [ 8.6031e-01, -3.6634e-01, 2.1443e-03, ..., 1.0505e+00, -2.2284e+00, 1.4173e+01]])

Please note there is a slight bug in the calculate function. The first line needs to be removed since it tries to take in all the data instead of batching it, causing it to max out memory and crash.

I calculated the covariance on the ViT-L-14 CLIP model since that's what SD 1.5 uses. My OpenClipAdapter looks like this: OpenClipAdapter(name="ViT-L-14", pretrained="laion400m_e32", tokenizer_name="ViT-L-14")

I saved the output with torch.save and have attached it here. Hope this helps.

input_covariance_ViT-L-14.zip

lucidrains commented 1 year ago

@BradVidler amazing Brad! thank you so much 🙏

i'll get this checked in

lucidrains commented 1 year ago

@BradVidler checked it in for version 0.1.6! we can now scratch the first step off the list haha

lucidrains commented 1 year ago

also addressed returning the prompt with superclass embed, if in training mode

lucidrains commented 1 year ago

also added a save and load function this morning. if you read the file, should be self explanatory. since the keys are locked to the super class concept output, there is still room to whittle down the size of the saved package even more

will get around to it next week

irowberry commented 1 year ago

Would love to contribute where I can, it looks like you're knocking out that list quickly though.

lucidrains commented 1 year ago

@irowberry sure, you can submit a PR at any time, or try wiring it up to SD 1.5 and give it a go

BradVidler commented 1 year ago

I believe step 2 can be achieved with the diffusers library like so: from diffusers import StableDiffusionPipeline pipeline = StableDiffusionPipeline.from_pretrained(f"runwayml/stable-diffusion-v1-5") pipeline.to("cuda") super_concept = "person" super_concept_token = pipeline.tokenizer.encode(super_concept)[1:-1] assert len(super_concept_token) == 1, "Super concept token should be 1 token." print(super_concept_token)

I'm using a slice because the first and last tokens are automatically start/end tokens, like this: [49406, 2533, 49407]

If it makes sense, open clip can be used as well with the tokenize function: from open_clip import tokenizer tokenizer.tokenize("person")

I assumed that going straight to the tokenizer that is loaded with the diffusers model is the better option to ensure we are working with the correct tokenizer version, but both ways end up with the same tokens.

lucidrains commented 1 year ago

@BradVidler hey Brad! yes, that is correct, and I can take care of that, since a number of text-to-image models all use the same open clip tokenizer. we can just use the one from open_clip

it occurred to me (thanks to your comment) that we can strike off both 2 and 4, by simply having all the prompts include only the superclass concept string. so, if i were trying to perfuse my dog into the weights, i would simply prepare a bunch of prompts with the superclass string dog, and at forward, substitute dog with the new concept id. Therefore, no modification of the BPE dictionary is needed

the EmbeddingWrapper would simply receive the superclass_concept_string: str as well as the tokenizer_encode: callable, and when fine tuning, you would just use dog

edit: probably a bit confusing, but let me just get the code down and it will be self explanatory

lucidrains commented 1 year ago

https://github.com/lucidrains/perfusion-pytorch/commit/03059d94b0ed3be1fe1d4ce5262ca99b287acdba this should knock off both 2 and 4 above

welcome any code reviews

lucidrains commented 1 year ago

i'm also going to cross off 8, as you just use the save and load function

lucidrains commented 1 year ago

ok, will get back to this next week. have a happy weekend!

shileims commented 1 year ago

Hi @lucidrains , Would you have a plan to complete the training based on SD1.5? If possible, would you try to add some instructions here to finish the code? I get some days' rest, so I will try to help here. Thanks

irowberry commented 1 year ago

@shileims I've almost finished some VERY rough training code. The last step that I'm working on is finding a way to take the output embeddings from the EmbeddingWrapper class and pass them through the rest of the text encoder to get the final text encodings that are then passed into the Rank1EditModules.

shileims commented 1 year ago

Hi @irowberry , You are really awesome!

irowberry commented 1 year ago

Just completed a successful forward and backward pass, still need to find out a way to pass in the modified embeddings to the rest of the text encoder model, but hopefully I can train a full test model soon. I counted the number of trainable params and compared it to the total amount, it's about 14% of the total weights (142243008 vs 1001776452) not sure if this is correct?

shileims commented 1 year ago

hi @irowberry , the paper justifies that the extra weights are 100kb. Is it useful here?

lucidrains commented 1 year ago

@irowberry nice! yea, i think weaving the text token ids through the clip text encoder is tricky, as you'll need to do some model surgery to omit the wrapped embedding. maybe i should offer some function where you pass in the path to the text encoder embedding, and it returns the wrapped version, and also substitutes the original one with a nn.Identity()

so 14% does not sound right. the number of parameters being optimized should be miniscule. are you using this function to fetch the parameters for the optimizer?

irowberry commented 1 year ago

@lucidrains I am using that function but I did forget to freeze the text encoder layers (whoops), that 14% sounded too big to me too. Now it's just 1.9% of the total parameters (19182528 / 1001776452).

lucidrains commented 1 year ago

@irowberry 🚀 bring it home

lucidrains commented 1 year ago

@irowberry added yet another wrapper that should help

shileims commented 1 year ago

@irowberry @lucidrains , By using the function "save" in save_load.py file. I get 50304 trainable parameters over 1066261035 parameters. The .pth file has 210KB and 2514KB without and with C_inv respectively. Any suggestions here? The paper says 100KB...........

lucidrains commented 1 year ago

@shileims close enough! I believe it is because I am redundantly saving the concept outputs for the keys across all cross attention layers. since it is locked, it only needs to be stored once

I can work on whittling down the size once / if you see some initial results

shileims commented 1 year ago

Hi @lucidrains , thanks for your reply. I will try to hookup the pipeline. Thanks

irowberry commented 1 year ago

Thoughts on loss functions? I'm using MSE to compare the noise to the predicted noise, loss = F.mse_loss(pred, target).mean() However, I've seen a lot of models use the mask as well, I'm not sure which one to use for this.

shileims commented 1 year ago

Follow question: how to use text_mask, attention_mask and image_mask in the pipeline? I am a little bit confused.

lucidrains commented 1 year ago

@irowberry just stick with the normal loss for now until you see some signal

we can see how good the technique is on its own

lucidrains commented 1 year ago

@shileims yea, masking is the most confusing part of learning attention

there is only one mask you need to worry about. the text_mask becomes the key_padding_mask in the cross attention layer (as the text tokens itself becomes the key / values). in other words, text_mask and the attention_mask is the same

the image_mask is something different. it is a segmentation mask outputted by another pretrained model, meant to help better guide the model during fine tuning by modulating the loss. basically what @irowberry asked about in the comment above. i say just ignore that for now

shileims commented 1 year ago

hi @lucidrains , But in the MemoryEfficientCrossAttention class from official SD repo, it shows the mask should be None.

lucidrains commented 1 year ago

@shileims oh that's interesting

how about the regular cross attention class?

shileims commented 1 year ago

Hi @lucidrains yes, the regular cross attention class copes with the mask.

Should we switch to regular cross attention?

lucidrains commented 1 year ago

@shileims yes, i think so

i'm actually surprised the memory efficient cross attention would work without masking, unless if it is intended for inference with one batch sample at a time (no key padding)

shileims commented 1 year ago

hi @lucidrains , Does ema_concept_text_encs require gradients? I think it is i*. self.register_buffer('ema_concept_text_encs', torch.zeros(num_concepts, dim_input, requires_grad=True))

lucidrains commented 1 year ago

@shileims no it does not, the authors update that using exponential moving average afaict

shileims commented 1 year ago

hi @lucidrains , @irowberry , Do you have a repo recommendation for integrating current rank-one model editing functions for the dream booth?

shileims commented 1 year ago

After implementation by myself, I found the same prompt with different runs will get totally the same results. Any suggestions here? Thanks

irowberry commented 1 year ago

Okay, I am not sure what I'm doing wrong, but these are the sorts of images I'm getting. image1-2

lucidrains commented 1 year ago

@irowberry nice! want to share your code? you and @shileims should also get a discord room and review each other's work

lucidrains commented 1 year ago

i'm in the middle of a project and can't context switch atm, but will try to get back to this this weekend

irowberry commented 1 year ago

Here's the code, I'm trying to get it to work with Hugging Face's models, and it does sort of. However, there's an important edit that needs to be made within the diffusers package. In diffusers/models/attention_processor.py in the Attention class, and prepare_attention_mask method, there is a TODO that says something about stable-diffusion-pipelines (line 408). You just need to change attention_mask to be F.pad(attention_mask, (0, target_length - current_length), value=0.0) in that if branch. train_perfusion.txt

lucidrains commented 1 year ago

@irowberry cool, want to check in that text file into either a gist or repo?

irowberry commented 1 year ago

Yeah I just made a repo. Here's the link https://github.com/irowberry/perfusion-training/blob/main/train_perfusion.py

lucidrains / perfusion-pytorch

Any plan for training code? #5