lucidrains/perfusion-pytorch

Perfusion - Pytorch

Implementation of Key-Locked Rank One Editing. Project page

The selling point of this paper is extremely low extra parameters per added concept, down to 100kb.

It seems they successfully applied the Rank-1 editing technique from a memory editing paper for LLM, with a few improvements. They also identified that the keys determine the "where" of the new concept, while the values determine the "what", and propose local / global-key locking to a superclass concept (while learning the values).

For researchers out there, if this paper checks out, the tools in this repository should work for any other text-to-<insert modality> network using cross attention conditioning. Just a thought

Appreciation

StabilityAI for the generous sponsorship, as well as my other sponsors out there
Yoad Tewel for the multiple code reviews and clarifying emails
Brad Vidler for precomputing the covariance matrix for the CLIP used in Stable Diffusion 1.5!
All the maintainers at OpenClip, for their SOTA open sourced contrastive learning text-image models

Install

$ pip install perfusion-pytorch

Usage

import torch
from torch import nn

from perfusion_pytorch import Rank1EditModule

to_keys = nn.Linear(768, 320, bias = False)
to_values = nn.Linear(768, 320, bias = False)

wrapped_to_keys = Rank1EditModule(
    to_keys,
    is_key_proj = True
)

wrapped_to_values = Rank1EditModule(
    to_values
)

text_enc = torch.randn(4, 77, 768)                  # regular input
text_enc_with_superclass = torch.randn(4, 77, 768)  # init_input in algorithm 1, for key-locking
concept_indices = torch.randint(0, 77, (4,))        # index where the concept or superclass concept token is in the sequence
key_pad_mask = torch.ones(4, 77).bool()

keys = wrapped_to_keys(
    text_enc,
    concept_indices = concept_indices,
    text_enc_with_superclass = text_enc_with_superclass,
)

values = wrapped_to_values(
    text_enc,
    concept_indices = concept_indices,
    text_enc_with_superclass = text_enc_with_superclass,
)

# after much training ...

wrapped_to_keys.eval()
wrapped_to_values.eval()

keys = wrapped_to_keys(text_enc)

values = wrapped_to_values(text_enc)

The repository also contains an EmbeddingWrapper that makes it easy to train on a new concept (and for eventual inference with multiple concepts)

import torch
from torch import nn

from perfusion_pytorch import EmbeddingWrapper

embed = nn.Embedding(49408, 512) # open clip embedding, somewhere in the module tree of stable diffusion

# wrap it, and will automatically create a new concept for learning, based on the superclass embed string

wrapped_embed = EmbeddingWrapper(
    embed,
    superclass_string = 'dog'
)

# now just pass in your prompts with the superclass id

embeds_with_new_concept, embeds_with_superclass, embed_mask, concept_indices = wrapped_embed([
    'a portrait of dog',
    'dog running through a green field',
    'a man walking his dog'
]) # (3, 77, 512), (3, 77, 512), (3, 77), (3,)

# now pass both embeds through clip text transformer
# the embed_mask needs to be passed to the cross attention as key padding mask

If you can identify the CLIP instance within the stable diffusion instance, you can also pass it directly to the OpenClipEmbedWrapper to gain everything you need on forward for the cross attention layers

ex.

from perfusion_pytorch import OpenClipEmbedWrapper

texts = [
    'a portrait of dog',
    'dog running through a green field',
    'a man walking his dog'
]

wrapped_clip_with_new_concept = OpenClipEmbedWrapper(
    stable_diffusion.path.to.clip,
    superclass_string = 'dog'
)

text_enc, superclass_enc, mask, indices = wrapped_clip_with_new_concept(texts)

# (3, 77, 512), (3, 77, 512), (3, 77), (3,)

Todo

[ ] wire up with SD 1.5, starting with xiao's dreambooth-sd
[ ] show example in readme for inference with multiple concepts
[ ] automatically infer where keys and values projection are if not specified for the make_key_value_proj_rank1_edit_modules_ function
[x] embedding wrapper should take care of substituting with super class token id and return embedding with super class
[x] review multiple concepts - thanks to Yoad
[x] offer a function that wires up the cross attention
[x] handle multiple concepts in one prompt at inference - summation of the sigmoid term + outputs
- [x] accept multiple concept indices
[x] offer a way to combine separately learned concepts from multiple Rank1EditModule into one for inference
- [x] offer function for merging Rank1EditModules
[x] add the zero-shot masking of concept proposed in paper
[x] take care of the function that takes in the dataset and text encoder and precomputes the covariance matrix needed for the rank-1 update
[x] instead of having the researcher worry about different learning rates, offer the fractional gradient trick from other paper (to learn the concept embedding)

Citations

@article{Tewel2023KeyLockedRO,
    title   = {Key-Locked Rank One Editing for Text-to-Image Personalization},
    author  = {Yoad Tewel and Rinon Gal and Gal Chechik and Yuval Atzmon},
    journal = {ACM SIGGRAPH 2023 Conference Proceedings},
    year    = {2023},
    url     = {https://api.semanticscholar.org/CorpusID:258436985}
}

@inproceedings{Meng2022LocatingAE,
    title   = {Locating and Editing Factual Associations in GPT},
    author  = {Kevin Meng and David Bau and Alex Andonian and Yonatan Belinkov},
    booktitle = {Neural Information Processing Systems},
    year    = {2022},
    url     = {https://api.semanticscholar.org/CorpusID:255825985}
}