rinongal / textual_inversion

MIT License
2.9k stars 279 forks source link

Would it be possible to train a multi-vector embedding using masking for features which individual vectors should be perfected for? #77

Closed CodeExplode closed 2 years ago

CodeExplode commented 2 years ago

This may be a bit of rambling craziness from somebody who perhaps doesn't quite understand the machine learning process well enough.

While investigating the possibility of building embeddings for characters which SD can't reproduce well, I've found that it will often perfect one feature before the others, only for those working features to get worse as it overtrains, or the weights are forced to change to try to improve something else.

A good example are fine details like eyes and mouths, which are unfortunately often blurry in the sample inputs due to the the downscaling to 64x64 images which is done in SD before it eventually upscales back up to 512x512, and so the embedding correctly leans to draw blurry eyes. Those features can start quite good especially if building from an existing initializer word, and so it seems that if they could be 'locked' in some way it would make complex embeddings viable with some extra manual specifications on the training data. If not by having entire vectors dedicated to them, then the training process could perhaps be told those features in masked regions getting worse is considered a major failure, if such a constraint on the underlying learning process is even possible. (maybe directed to only be tested on images where the feature comes out quite well even after the downscale process, such as closer face images)

A lot of Stable Diffusion branches have added masking for inpainting only a region of the input image (with an optional amount of noise to add first), which seems like it could be applied to the embedding learning process quite well, ensuring an embedding or sub-range of vectors within it are only tested on reproducing say a new skin colour on the masked skin. I've managed to add custom prompts per input sample, and ways of randomizing them a bit during testing, so think it would be very viable to train the embedding alongside phrases like "a photo of seen from the side" and "a drawing of seen from slightly above" to help find an embedding which works in the context of having other prompts balancing the process in certain ways (I've had some success with that but aren't 100% sure if it's just slowed down the learning process and if editability is just dropping more slowly).

Another example is trying to build an embedding for a character with orange skin and complex blur headpiece, often the skin will be solved early on, but the headpiece takes much longer to be resolved with front/back/side views, and in the process the skin becomes very corrupted, turning into a black and blue face mask.

If inputs were labelled in some format like "0001, a photo of @ from the front on a sunny day, wearing a yellow t-shirt.png", and "0001, mask-skin.png", then it would be somewhat straightforward to match them, and pick out a sub-selection of the embedding vectors which are grouped under the label 'skin' in some mapping table. Whether they could be tested exclusively on resolving the area in the mask correctly is unclear to me. While you could build a unique embedding for each masked feature, and chain them together in a prompt in the same order that the vectors of a single multi-vector embedding would be used, it seems it would be good to also have some tests of all the relevant vectors in a scene together, where each is unlocked for the possibility of change, before they go back to ensuring their own results are good (and for elements like the eyes, maybe you could even revert back to an older vector, though it seems unlikely that vectors not trained together would work together in unison, but then you can say change a person's eye colour sometimes with a prompt which is kind of the same thing).

edit: I see now that requires_grad could perhaps be enabled/disabled for vector groups in a given example image, depending on whether they're represented. Now it's just a matter of figuring out if masking can be applied per trained feature, and what effect that might have.

CodeExplode commented 2 years ago

On a side note, I have tried building single token, multi-vector embeddings which take a whole sentence of initializer words, (e.g. 'a man with red skin and a blue headpiece') in the hope the hope that each vector would provide a framework for features to fall into. Training on a character with orange skin curiously started producing them with actual orange fruit skin and apple skin lips, as if it had managed to switch the word red for orange in English terms, and then was interpreting that as 'orange fruit skin' as the vectors were fed in the same as a chain of natural language words would be. Locking those vectors at a point that they work might have been enough.

CodeExplode commented 2 years ago

I've tried this in effect by changing the code to always surround the embedding with consistent descriptions of eyes etc which it will need to learn to function in conjunction with, but it didn't seem to help, and instead seemed to train the embedding to compensate for those words and still achieve the same result.