Open shileims opened 1 year ago
hey yea, i'll circle back to it
you should give it a try though. hard part is done and all the pieces are there, for both single and multi-concept
Would you mind giving a little advice on how to run it for us non-developers? Like I understand how to install and all that, but actually running the training isn't going to be easy for me and a lot of others to figure out on our own.
+1 as I am also very interested if you have the time. I'm mostly wondering the broad order of what has to be done, and how we can go about saving the "100kb file" and apply it to a diffusers pipeline.
i would recommend starting with xiao's dreambooth / SD repository first, since that is what the authors used. Yoad also offered to further debug once we had something wired together
at this point, you'll still need to find a developer / researcher / ML engineer. i'll try to get it to the point where it is just a command line when i find time
broad order, for any entry level ML engineer who is looking to up their skills
~1. run a subset of laion dataset texts through the function for generating the input covariance~ ~2. determine the super concept id (must be 1 token id) from embedding table~
CLIP
instance with OpenClipEmbedWrapper
, passing it also your superclass_string
~4. modify the BPE string to id mapping for your new *concept*~Rank1EditModule
OpenClipEmbedWrapper
through the cross attention, and into the Rank1EditModule
from step 5get_finetune_optimizer
on your wired up stable diffusion instance. it should automatically extract only the small number of finetuneable parameterssave
and load
functions (just pass it in the wired stable diffusion instance)i realize there's still a lot of steps, i'll try to do a bit more damage next week. just writing that up, i realize for number 6, i could also take care of the whole fetching the embedding with the concept substituted with super class concept in the embedding wrapper.
actually, i'll take care of 6 tomorrow morning. that's probably the remaining tool needed before operating on SD
the input covariance also shouldn't change, so perhaps someone can run this for the clip being used in SD, and we can just check in the tensor, use it as a default. that would save us from passing around this extra matrix
Here's what I got for the input covariance by taking the first 100k captions from the LAION 400m dataset and running it through the function (perhaps it can be verified in the future):
tensor([[ 9.3166e+00, -4.6043e-02, -1.1625e-01, ..., 6.4781e-02, -1.4543e+00, 8.6031e-01], [-4.6043e-02, 1.0780e+01, -2.8464e-01, ..., 7.0342e-01, -2.0617e-01, -3.6634e-01], [-1.1625e-01, -2.8464e-01, 1.0130e+01, ..., 3.4655e-01, -2.7999e-01, 2.1443e-03], ..., [ 6.4781e-02, 7.0342e-01, 3.4655e-01, ..., 1.0315e+01, -6.7360e-01, 1.0505e+00], [-1.4543e+00, -2.0617e-01, -2.7999e-01, ..., -6.7360e-01, 1.2800e+01, -2.2284e+00], [ 8.6031e-01, -3.6634e-01, 2.1443e-03, ..., 1.0505e+00, -2.2284e+00, 1.4173e+01]])
Please note there is a slight bug in the calculate function. The first line needs to be removed since it tries to take in all the data instead of batching it, causing it to max out memory and crash.
I calculated the covariance on the ViT-L-14 CLIP model since that's what SD 1.5 uses. My OpenClipAdapter looks like this:
OpenClipAdapter(name="ViT-L-14", pretrained="laion400m_e32", tokenizer_name="ViT-L-14")
I saved the output with torch.save and have attached it here. Hope this helps.
@BradVidler amazing Brad! thank you so much 🙏
i'll get this checked in
@BradVidler checked it in for version 0.1.6! we can now scratch the first step off the list haha
also addressed returning the prompt with superclass embed, if in training mode
also added a save
and load
function this morning. if you read the file, should be self explanatory. since the keys are locked to the super class concept output, there is still room to whittle down the size of the saved package even more
will get around to it next week
Would love to contribute where I can, it looks like you're knocking out that list quickly though.
@irowberry sure, you can submit a PR at any time, or try wiring it up to SD 1.5 and give it a go
I believe step 2 can be achieved with the diffusers library like so: from diffusers import StableDiffusionPipeline pipeline = StableDiffusionPipeline.from_pretrained(f"runwayml/stable-diffusion-v1-5") pipeline.to("cuda") super_concept = "person" super_concept_token = pipeline.tokenizer.encode(super_concept)[1:-1] assert len(super_concept_token) == 1, "Super concept token should be 1 token." print(super_concept_token)
I'm using a slice because the first and last tokens are automatically start/end tokens, like this: [49406, 2533, 49407]
If it makes sense, open clip can be used as well with the tokenize function: from open_clip import tokenizer tokenizer.tokenize("person")
I assumed that going straight to the tokenizer that is loaded with the diffusers model is the better option to ensure we are working with the correct tokenizer version, but both ways end up with the same tokens.
@BradVidler hey Brad! yes, that is correct, and I can take care of that, since a number of text-to-image models all use the same open clip tokenizer. we can just use the one from open_clip
it occurred to me (thanks to your comment) that we can strike off both 2 and 4, by simply having all the prompts include only the superclass concept string. so, if i were trying to perfuse my dog into the weights, i would simply prepare a bunch of prompts with the superclass string dog
, and at forward
, substitute dog
with the new concept id. Therefore, no modification of the BPE dictionary is needed
the EmbeddingWrapper
would simply receive the superclass_concept_string: str
as well as the tokenizer_encode: callable
, and when fine tuning, you would just use dog
edit: probably a bit confusing, but let me just get the code down and it will be self explanatory
https://github.com/lucidrains/perfusion-pytorch/commit/03059d94b0ed3be1fe1d4ce5262ca99b287acdba this should knock off both 2 and 4 above
welcome any code reviews
i'm also going to cross off 8, as you just use the save
and load
function
ok, will get back to this next week. have a happy weekend!
Hi @lucidrains , Would you have a plan to complete the training based on SD1.5? If possible, would you try to add some instructions here to finish the code? I get some days' rest, so I will try to help here. Thanks
@shileims I've almost finished some VERY rough training code. The last step that I'm working on is finding a way to take the output embeddings from the EmbeddingWrapper
class and pass them through the rest of the text encoder to get the final text encodings that are then passed into the Rank1EditModules
.
Hi @irowberry , You are really awesome!
Just completed a successful forward and backward pass, still need to find out a way to pass in the modified embeddings to the rest of the text encoder model, but hopefully I can train a full test model soon. I counted the number of trainable params and compared it to the total amount, it's about 14% of the total weights (142243008 vs 1001776452) not sure if this is correct?
hi @irowberry , the paper justifies that the extra weights are 100kb. Is it useful here?
@irowberry nice! yea, i think weaving the text token ids through the clip text encoder is tricky, as you'll need to do some model surgery to omit the wrapped embedding. maybe i should offer some function where you pass in the path to the text encoder embedding, and it returns the wrapped version, and also substitutes the original one with a nn.Identity()
so 14% does not sound right. the number of parameters being optimized should be miniscule. are you using this function to fetch the parameters for the optimizer?
@lucidrains I am using that function but I did forget to freeze the text encoder layers (whoops), that 14% sounded too big to me too. Now it's just 1.9% of the total parameters (19182528 / 1001776452).
@irowberry 🚀 bring it home
@irowberry added yet another wrapper that should help
@irowberry @lucidrains , By using the function "save" in save_load.py file. I get 50304 trainable parameters over 1066261035 parameters. The .pth file has 210KB and 2514KB without and with C_inv respectively. Any suggestions here? The paper says 100KB...........
@shileims close enough! I believe it is because I am redundantly saving the concept outputs for the keys across all cross attention layers. since it is locked, it only needs to be stored once
I can work on whittling down the size once / if you see some initial results
Hi @lucidrains , thanks for your reply. I will try to hookup the pipeline. Thanks
Thoughts on loss functions? I'm using MSE to compare the noise to the predicted noise, loss = F.mse_loss(pred, target).mean()
However, I've seen a lot of models use the mask as well, I'm not sure which one to use for this.
Follow question: how to use text_mask, attention_mask and image_mask in the pipeline? I am a little bit confused.
@irowberry just stick with the normal loss for now until you see some signal
we can see how good the technique is on its own
@shileims yea, masking is the most confusing part of learning attention
there is only one mask you need to worry about. the text_mask
becomes the key_padding_mask
in the cross attention layer (as the text tokens itself becomes the key / values). in other words, text_mask
and the attention_mask
is the same
the image_mask
is something different. it is a segmentation mask outputted by another pretrained model, meant to help better guide the model during fine tuning by modulating the loss. basically what @irowberry asked about in the comment above. i say just ignore that for now
hi @lucidrains , But in the MemoryEfficientCrossAttention class from official SD repo, it shows the mask should be None.
@shileims oh that's interesting
how about the regular cross attention class?
Hi @lucidrains yes, the regular cross attention class copes with the mask.
Should we switch to regular cross attention?
@shileims yes, i think so
i'm actually surprised the memory efficient cross attention would work without masking, unless if it is intended for inference with one batch sample at a time (no key padding)
hi @lucidrains , Does ema_concept_text_encs require gradients? I think it is i*. self.register_buffer('ema_concept_text_encs', torch.zeros(num_concepts, dim_input, requires_grad=True))
@shileims no it does not, the authors update that using exponential moving average afaict
hi @lucidrains , @irowberry , Do you have a repo recommendation for integrating current rank-one model editing functions for the dream booth?
After implementation by myself, I found the same prompt with different runs will get totally the same results. Any suggestions here? Thanks
Okay, I am not sure what I'm doing wrong, but these are the sorts of images I'm getting.
@irowberry nice! want to share your code? you and @shileims should also get a discord room and review each other's work
i'm in the middle of a project and can't context switch atm, but will try to get back to this this weekend
Here's the code, I'm trying to get it to work with Hugging Face's models, and it does sort of. However, there's an important edit that needs to be made within the diffusers package. In diffusers/models/attention_processor.py
in the Attention
class, and prepare_attention_mask
method, there is a TODO that says something about stable-diffusion-pipelines (line 408). You just need to change attention_mask to be F.pad(attention_mask, (0, target_length - current_length), value=0.0)
in that if branch.
train_perfusion.txt
@irowberry cool, want to check in that text file into either a gist or repo?
Yeah I just made a repo. Here's the link https://github.com/irowberry/perfusion-training/blob/main/train_perfusion.py
Hi Author, This is a really amazing repo. I check the progress every day and hope to get it to try. Do you have an estimation deadline for finishing the training code? Thank you so much!