Got weird results, not sure if I missed a step?

altryne commented 1 year ago

Hey @rinongal thank you so much for this amazing repo.

I trained with over 10K steps I believe, and around 7 images. (Trained on my face) Using this colab

I then used those pt files in running the SD version right in the collab and a weird thing happens, when I mention * in my prompts, I get results that look identical to the photos in style, but it does try to ... draw the objects.

For example : CleanShot 2022-08-29 at 14 04 48@2x

Prompt was portrait of joe biden with long hair and glasses eating a burger, detailed painting by da vinci and portrait of * with long hair and glasses eating a burger, detailed painting by da vinci

So SD added the glasses and the eating pose, but completely disregarded the detailed painting and da vincin and the style.

What could be causing this? Any idea? 🙏

rinongal commented 1 year ago

Hey!

The most likely candidate is just that our SD version isn't officially released yet because it's not behaving well under new prompts :) It's placing too much weight on the new embedding, and too little on other words. We're still trying to work that out, but it wasn't as simple a port from LDM as we hoped. If this is the issue, you can try to work around it by repeating the parts of the prompt that it ignores, for example by adding "In the style of da vinci" again at the end of the prompt.

With that said, if you want to send me your images, I'll try training a model and seeing if I can get it to behave better.

altryne commented 1 year ago

Thank you! I'll try putting more weight on the later keywords! Don't think my images are anythings special or important to test with, I just took a few snapshots of myself ,and cropped to 512x512.

ExponentialML commented 1 year ago

One thing I found that helps and/or fixes this scenario is adding periods to your prompts, not commas like in original SD repo. This may or may not be a bug.

So this: portrait of * with long hair and glasses eating a burger, detailed painting by da vinci

Should become this: portrait of * with long hair and glasses eating a burger. detailed painting by da vinci.

If you trained on one token, you could possibly add weight by doing something like portrait of * * ...rest as well, but you'll get further away from the rest of your prompt

ThereforeGames commented 1 year ago

If you're using the web UI (i.e. this repo: https://github.com/hlky/stable-diffusion-webui ), you can specify weight to certain tokens as such:

A photo of *:100 smiling.

I frequently have to do this with the finetuned object, sometimes using astronomical values like 1000+. This can greatly improve likeness. You may also need to adjust classifier guidance and denoise strength. All of these parameters do impact each other, and changing one often means needing to re-calibrate the rest.

Anyhow, you can try applying strength to the part of the prompt that SD is ignoring. Something like this:

portrait of * with long hair and glasses eating a burger, detailed painting:10 by da vinci:10

altryne commented 1 year ago

If you're using the web UI

I'm one of the maintainers in charge of the frontend part but TBH I haven't yet added my own checkpoints to the webui! Will do that tomorrow

I will def try this! Thank you

oppie85 commented 1 year ago

I've found limited success in "diluting" the new token by making the prompt more vague - for exmple "a painting of *" results in pretty much the same image as just "*" on its own, but "a painting of a man who looks exactly like *" does (sometimes) work in succesfully applying a different style. Adding weights to the tokens as others have described also works, although it requires constant tweaking.

I don't know if it would be technically possible to test for style transfer during the training/validation phase; for example, on top of the 'preset' prompts that are used on the photos in the dataset, you would have a separate list of prompts like "A painting of *" that would be used to verify that an image generated with that prompt would also score high on the 'painting' token. In the DreamBooth paper, they describe that they combatted overfitting (which I guess is causing these issues) by also training 'negatively' - something which I've tried to rudimentally replicate by including prompts without the "*" in the list of predefined ones, but I don't think this would actually do anything since the mechanism for DreamBooth and Textual Inversion are very different.

1blackbar commented 1 year ago

If you guys had a single instance of succefully finetuning a photo likeness of human being into SD with this code please share, ive yet to see that and im almost sure that this code is not to "inject" your own face into SD model as people might think.

ThereforeGames commented 1 year ago

If you guys had a single instance of succefully finetuning a photo likeness of human being into SD with this code please share

I won't be sharing my model at this time, but I can tell you that this method is indeed capable of pulling off a convincing headswap under the right conditions:

With photorealistic subjects (people), I have had better results when providing the model ~10 images and training longer than suggested (25-40k iterations). This could be a fluke, I'm sure the authors of the research paper know what they're talking about when they say 5 is the optimal number of images - but I'm not convinced it's always 5.
txt2txt often produces mediocre and "samey" results with finetuned checkpoints. Try img2img instead. You'll get more variety in terms of facial expression and surprisingly higher fidelity in the face itself. Using photos as your img2img input is better than using simple drawings or other kinds of illustrations. Denoise strength should be between 0.4 to 0.75 depending on how large the face is in your input image (larger face = go for higher denoise strength.)
Play around a lot with CFG and prompt weights. You can crank CFG to the 10-20 range to improve likeness at the cost of potentially introducing visual artifacts (can be counteracted to some extent by increasing inference steps). Likewise, you can apply more weight to your finetuned object by writing :10 or :100 etc, in your prompt.
k_euler_a sampling method seems to be the best for photorealistic people.
For best results, take an image from SD and throw it into a traditional faceswap solution like SimSwap or sber-swap.

Hope that helps.

1blackbar commented 1 year ago

Well... i already heard that , its not saying much without comparison of actual photo and sd output .Even paper doesnt have result with human subjects.Some people claimed to do it but then i looked at pics and sd output was not the person thats on training data images.Vaguely yes it was same skin colour, similar haircut but proportions of the face against nose and lips... all mixed up from result to result. So i stand by what i wrote, this method so far is not capable of finetuning a human likeness and synhesizing it in SD, until proven otherwise. I dont mind trainig for a long time, i just want to know if ill be wasting my time and blocking gpu for nothing if ill never be able to get at least 90% likeness. Almost all if not all results ive seen look like derivatives/mutations of the subjects and not like actual subject. Identity loss is one of the biggest issues in face synthesis and restoration.Few managed to solve it. I trained 3-4 subjects with about 30k iterations each, results were not succesfull ( well it did "learned" them but they looked like mutations of subjects)besides one with training a style that was bigger success, so for now id wait until i see someone pushing finetuning and proving it can be done and you can synthesize a finetuned face that looks like on original images.

oppie85 commented 1 year ago

Here's what you can try to verify that textual inversion can create a convincing likeness; First of all train at 256x256 pixels with larger batch sizes; depending on your GPU you can easily train 4x as fast so you'll see results sooner. The downside of this is that only the ddim sampler really works with the final result, but I feel like that's an acceptable tradeoff if your main goal is just to check whether or not it's even possible. Also bump up the num_vectors_per_token a bit; if you're not worried about overfitting you can even bump it up to ridiculous levels like 256 (edit: I've now learned that putting this higher than 77 is useless because SD has a limit of 77 tokens per input) - the result of this is that you'll get a convincing likeness way quicker, but it'll never deviate that much from the original photos and style transfer may be impossible.

I've fiddled a lot with all kinds of parameters and have gotten results that are all over the place; with the 256x256 method I can iterate pretty quickly but the end result is always overfitting. For example, most of the photos I used were in an outdoors setting and textual inversion thus inferred that being outdoors was such a key feature that it'd try to replicate the same outdoor settings for every generation. I thought that maybe adding A * man outdoors (and variations) would help in separating the location from the token, but I feel that it only reinforces it because now generated images that are in an outdoors setting score even higher on matching the prompt.

I think that's largely where the problem lies; apart from the initial embedding from the 'initializer word', there's no way to 'steer' training towards a particular subject. When using a conditioning prompt like A * man outdoors with a red shirt the conditioning algorithm doesn't know that it can disregard the "red shirt" part and that it should focus on the magic * that makes the difference between the encoding of a regular man and myself. I don't know if it would be possible to basically train on two captions for each image; for example, we apply a * man outdoors in a red shirt and a man outdoors in a red shirt (without the *) and then take only the difference in the encoding instead of the entire thing.

ExponentialML commented 1 year ago

The two things that had the most success for me are:

Replace the template string with a single {}
Make sure you're using the sd-v1-4-full-ema.ckpt

I'm almost positive that the reason for overfitting in SD is that the conditioning scheme is far too aggressive. Simply letting the model condition itself on the single init word alone is sufficient in my opinion, and has always lead to better results for me.

What's funny is that you're staying close to Stable Diffusion's ethos of heavy prompting, because conditioning this way makes it to where you have to come up with the correct prompt during inference, rather than let the conditioned templates do the work.

Even if you have low confidence in this method, I say it's most certainly worth looking into. I'm also certain that PTI integration will mitigate a lot of these issues (it's a very cool method for inversion if you haven't looked into it).

1blackbar commented 1 year ago

Well, i just fed it 2 pics of stallone, and im closest than i ever was with any face after 1500 iters , but its 256 size and 50 vectors , two init words - face ,photo. So i have a plan, once it reaches likeness of reconstruction images, i will feed it 512 images, can i swap size like that when continuing finetuning , from 256 to 512 ? samples_scaled_gs-002000_e-000010_b-000000_00000021

But i must say that reconstruction at 256 res is not looking too god tho, lost likeness a bit , this one looks better at 512 res this image on bottom a is reconstruction, not actual sample, its how model interpreted original image and it trains from this :

reconstruction_gs-001000_e-000005_b-000000_00000008

oppie85 commented 1 year ago

I'd say 2 photos is actually not enough for training a likeness; I use around 10-20 pictures for my experiments. For the 256x256 method it works best to mix in a few extreme closeups of the face so that the AI can learn the finer details. I don't actually know if starting on 265x265 and then resuming at 512x512 is possible - I think it should be though because that's how SD was trained in the first place. For init words, I don't think "photo" is very good - I'm using "man" and "face" for that purpose - because those are the things that I want the AI to learn. Nevertheless; 1500 interations isn't very much. I usually get the best results at around 3000.

1blackbar commented 1 year ago

Yes ill try that, its also strange i cant have batchsize of 2 with 11gb of ram on 256 res. Does batch size affect the training? i think if it sees more images at once it learns better ? If thats the case id try on colab pro. I also try man face but i wanted it to know that its a photo version , aphoto style, so maybe wit that it could be editable easier with styles. I have tight close up on face (jaw to chin) so i can show it likeness better at that res now.I noticed in SD you lose likeness when you are in medium shot but on macro close up you get best likeness of a person WEll... im qute impressed now , barely started and thats the result on epoch 4 and 1500 iters, how many epochs you recommend ? Sorry to hijack like this but im sure more people will come so i think this could be useful for them to read samples_scaled_gs-001500_e-000003_b-000300_00000016

OK so far from what i see... you should have mostly macro face close ups to get best identity, no ears visible besides one image like that stallone pic above, the rest should be very tight close ups of the face , probably even tighter than this one below

Ill try to resume and give it even tighter one , or start over with only tight macro shots of the face, cause im training mostlyu face and 256 is a bit low

Wow this is pretty good, way above my expectations samples_scaled_gs-002000_e-000005_b-000000_00000021 Oh crap this side shot looks too good, i wonder how editability wil work samples_scaled_gs-004000_e-000010_b-000000_00000041 Ok... i think that proves it, you can actually train a human face and retain identity ... this result is beyond what i expected and it barely started finetuning samples_scaled_gs-002500_e-000001_b-001000_00000026 OK so if anyone wants to get good results - drop resolution to 256 on bottom of yaml file
train: target: ldm.data.personalized.PersonalizedBase params: size: 256

Also use init word "face " then actually give it face shots, not head shots , i got most images like this one and maybe two of whole head but majority is from eyebrows to lower lip framing OK final result, 11k iterations , almost fell of chair when ive seen this result , most of my images were from hairline to jawline, 2 images of full head, overall 10 images samples_scaled_gs-006500_e-000004_b-000500_00000066

ExponentialML commented 1 year ago

W̶e̶l̶l̶,̶ ̶t̶h̶i̶s̶ ̶c̶e̶r̶t̶a̶i̶n̶l̶y̶ ̶i̶s̶ ̶a̶n̶ ̶i̶n̶t̶e̶r̶e̶s̶t̶i̶n̶g̶ ̶d̶i̶s̶c̶o̶v̶e̶r̶y̶.̶ ̶ ̶ ̶

S̶o̶ ̶t̶h̶i̶s̶ ̶c̶o̶u̶l̶d̶ ̶t̶h̶e̶o̶r̶e̶t̶i̶c̶a̶l̶l̶y̶ ̶p̶r̶o̶v̶e̶ ̶t̶h̶a̶t̶ ̶y̶o̶u̶ ̶n̶e̶e̶d̶ ̶t̶o̶ ̶f̶i̶n̶e̶ ̶t̶u̶n̶e̶ ̶o̶n̶ ̶t̶h̶e̶ ̶b̶a̶s̶e̶ ̶r̶e̶s̶o̶l̶u̶t̶i̶o̶n̶ ̶S̶t̶a̶b̶l̶e̶ ̶D̶i̶f̶f̶u̶s̶i̶o̶n̶ ̶w̶a̶s̶ ̶t̶r̶a̶i̶n̶e̶d̶ ̶o̶n̶,̶ ̶a̶n̶d̶ ̶n̶o̶t̶ ̶t̶h̶e̶ ̶u̶p̶s̶c̶a̶l̶e̶d̶ ̶r̶e̶s̶ ̶(̶5̶1̶2̶)̶.̶ ̶E̶i̶t̶h̶e̶r̶ ̶w̶a̶y̶ ̶t̶h̶i̶s̶ ̶s̶h̶o̶u̶l̶d̶n̶'̶t̶ ̶h̶a̶v̶e̶ ̶c̶a̶u̶s̶e̶d̶ ̶t̶h̶e̶ ̶i̶s̶s̶u̶e̶s̶ ̶p̶e̶o̶p̶l̶e̶ ̶h̶a̶v̶e̶ ̶b̶e̶e̶n̶ ̶h̶a̶v̶i̶n̶g̶ ̶a̶t̶ ̶t̶h̶e̶ ̶h̶i̶g̶h̶e̶r̶ ̶r̶e̶s̶o̶l̶u̶t̶i̶o̶n̶,̶ ̶s̶o̶ ̶I̶ ̶w̶o̶n̶d̶e̶r̶ ̶w̶h̶y̶ ̶t̶h̶i̶s̶ ̶i̶s̶?̶ ̶I̶'̶l̶l̶ ̶h̶a̶v̶e̶ ̶t̶o̶ ̶r̶e̶a̶d̶ ̶t̶h̶r̶o̶u̶g̶h̶ ̶t̶h̶e̶ ̶p̶a̶p̶e̶r̶ ̶a̶g̶a̶i̶n̶ ̶t̶o̶ ̶f̶i̶g̶u̶r̶e̶ ̶i̶t̶ ̶o̶u̶t̶.̶ ̶

Edit: Tested this and figured I'm wrong here. It simply allows for better inversion, which the model is fully capable of. The real issue is adding prompts to the embeddings, which is still WIP.

altryne commented 1 year ago

drop resolution to 256 on bottom of yaml file with provided images to train also set to 256?

1blackbar commented 1 year ago

altryne is it cause of 50 vectors that i used or is it cause of 256 res drop ? which one is more responsible for this ? I restarted tuning, had it at 1 vector, now compared to 50 vectors id say this makes the difference the most, but whats the downside of using so many vectors? whats the most sane amount i can use and get reasonable editability ? You can see pretty much from first 3 sample that you will get likeness, now im trying 20 vectors.

AUTOMATIC1111 commented 1 year ago

So does anyone here know how to properly work with this? This is a [50, 768] tensor. All embeddings I've seen before are [1, 768]. Are you supposed to insert all 50 into the prompt, taking space of 50 tokens out of available 75? All the code that I've seen fails to actually use this embedding, include this repository, failing with error:

Traceback (most recent call last):
  File "stable_txt2img.py", line 287, in <module>
    main()
  File "stable_txt2img.py", line 241, in main
    uc = model.get_learned_conditioning(batch_size * [""])
  File "B:\src\stable_diffusion\textual_inversion\ldm\models\diffusion\ddpm.py", line 594, in get_learned_conditioning
    c = self.cond_stage_model.encode(c, embedding_manager=self.embedding_manager)
  File "B:\src\stable_diffusion\textual_inversion\ldm\modules\encoders\modules.py", line 324, in encode
    return self(text, **kwargs)
  File "B:\soft\Python38\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "B:\src\stable_diffusion\textual_inversion\ldm\modules\encoders\modules.py", line 319, in forward
    z = self.transformer(input_ids=tokens, **kwargs)
  File "B:\soft\Python38\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "B:\src\stable_diffusion\textual_inversion\ldm\modules\encoders\modules.py", line 297, in transformer_forward
    return self.text_model(
  File "B:\soft\Python38\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "B:\src\stable_diffusion\textual_inversion\ldm\modules\encoders\modules.py", line 258, in text_encoder_forward
    hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids, embedding_manager=embedding_manager)
  File "B:\soft\Python38\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "B:\src\stable_diffusion\textual_inversion\ldm\modules\encoders\modules.py", line 183, in embedding_forward
    inputs_embeds = embedding_manager(input_ids, inputs_embeds)
  File "B:\soft\Python38\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "B:\src\stable_diffusion\textual_inversion\ldm\modules\embedding_manager.py", line 101, in forward
    embedded_text[placeholder_idx] = placeholder_embedding
RuntimeError: shape mismatch: value tensor of shape [50, 768] cannot be broadcast to indexing result of shape [0, 768]

I manually inserted those 50 embeddings into prompt in order, and I am getting pictures of Stallone, but they all seem very same-y, which to me looks similar to overfitting, but I don't know if it's that or me incorrectly working with those embeddings.

Here are 9 pics all with different seeds: grid-0000-3166629621

oppie85 commented 1 year ago

You also have to update num_vectors_per_token in v1-inference.yaml to the same value you trained with.

With 50 vectors per token, extreme overfitting is to be expected; I'm currently trying to find the right balance between the very accurate likeness with many vectors and the more varied results of fewer vectors. The codebase also contains an idea of 'progressive words' where new vectors get added as training progresses which might be interesting to explore.

Oh, also; a .pt file trained with 256x256 images only really works well with the ddim encoder; given enough vectors it'll look "acceptable" with k_lms but if you want the same quality you got with the training samples, use ddim.

Another thing I've been experimenting with is different initializer words per vector - for example, I set num_vectors_per_token to 4 and then pass "face", "eyes", "nose", "mouth" as the initializer words in the hopes that each of the vectors will focus on that one specific part of the likeness. So far I'm not sure if I'd call it a success, but at this point I'm just throwing random idea I get at it.

AUTOMATIC1111 commented 1 year ago

Ah. That did the trick, thank you. If anyone cares, here's 9 images produced by this repo's code on the stallone embedding:

res

DDIM. Previous pic I posted was using euler ancestral from k-diffusion.

I used just * as prompt in both cases.

1blackbar commented 1 year ago

You also have to update num_vectors_per_token in v1-inference.yaml to the same value you trained with.

With 50 vectors per token, extreme overfitting is to be expected; I'm currently trying to find the right balance between the very accurate likeness with many vectors and the more varied results of fewer vectors. The codebase also contains an idea of 'progressive words' where new vectors get added as training progresses which might be interesting to explore.

Oh, also; a .pt file trained with 256x256 images only really works well with the ddim encoder; given enough vectors it'll look "acceptable" with k_lms but if you want the same quality you got with the training samples, use ddim.

Another thing I've been experimenting with is different initializer words per vector - for example, I set num_vectors_per_token to 4 and then pass "face", "eyes", "nose", "mouth" as the initializer words in the hopes that each of the vectors will focus on that one specific part of the likeness. So far I'm not sure if I'd call it a success, but at this point I'm just throwing random idea I get at it.

Im currently quick testing if i can still edit a style when use 5 vectors, are the cloned heads the result of 256 training? Can i resume training and change it to 512 or will it start over from 0 after i change to 512? Also did spreading 4 vectors into 4 init words helped ? Maybe i made a mistake by using face,photo as init words and it pushed him deep into photo realm, i will try vague "male" OK with 5 vectors i managed to rip out a person from photo style into anime style butits veryu hard, needs repetitions of word anime style, more than usual, so id say 5 is already too much but with 5 likeness is crap.... so thats that I think with all 77 vectors you will get great likeness right away, but there wont be any room left for editability . Ill try training for short time and highest vectors, then i try to spread inint words usiong high vectors I will also do other method, using more precise init words like , lips, cheeks,nose, nostrils, eyes,eyelids,chin,jawline, whatever i can find, and high vectors, maybe it will spread into details more and will leave stye up to editing

1blackbar commented 1 year ago

Overwhelmed overfitting with prompt ,from what i see, if you use 50 vectors, you just wasted 50 words in a prompt on your subject being a photograph of a man, so you have like 27 left to skew it into painting or a drawing ? so you have to overwhelm it hard to change a style. Or it might be that you have to use over 50 words to overwhelm it, theres definitely a ratio cause i can overwhelm low vector results faster . this is 50 vectors

00796-3972612447-ilya_repin!!__oil_painting_,_centered_macro_head_shot_of__character__as_soldier_character_by_ilya_repin_,_painting,_hires,detail

00793-2276069365-ilya_repin!!__oil_painting_,_macro_head_shot_of__character__as_soldier_character_by_ilya_repin_,_painting,_hires,detailed__ilya_

altryne commented 1 year ago

Try playing with prompt weights in the webui?

1blackbar commented 1 year ago

started over, 2 vectors , 256 res, its at 36 epoch and 48k iters, will it be more editable than 50 vectors, we will see , i dont like the mirroring thing , how to turn it off ? his face is not identical when flipped samples_scaled_gs-047000_e-000036_b-000200_00000471 Ok after testing for editability, the one with 50 vector is a better way, it takes abouyt the same overwhelming to edit a style of 50 vectors as it takes 2 vector one but it takes about hour to train 50 vectors and takes like 8 hours to train 2 vectors to satisfied identity of the subject on 1080ti. Training 512 on 11gb 1080ti is a waste of time, go with 256 res, maybe its a ram thing and batchsize thing, you wont get likeness with 512, not in one day anyway. I guess overfitting is just a thing we have to live with for now, identity preservation is way more important IMO. 00926-3513507886-classic_oil_painting_by_ilya_repin!!!__detailed_oil_painting__of_character_sly__as_rambo_in_the_stye_of__ilya_repin_,_intricate_

bmaltais commented 1 year ago

This is really interesting... I would like to as, how do you resume training? I have been looking aroung as to how to do that and can't find the answer. An example would be appreciated.

EDIT: Found my answer here: https://github.com/rinongal/textual_inversion/issues/38

1blackbar commented 1 year ago

Got around overfitting, thats not an issue anymore, go with as many vectors as you like to speedup training, got new subject to train on , style change is not an issue at all , adapts even to cartoon styles , res 448 ,will do 512 later on, you can also control emotions of the face to make it smile 00103-1816228968-image_of_blee

00229-2336580884-image_of__blee 00345-1400431675-image_of__blee 00462-3494633885-johnny_blee 00253-3852576016-image_of__blee 00300-3887535150-image_of__blee 00167-189900867-image_of_smiling_happy_blee

dboshardy commented 1 year ago

@1blackbar how did you resolve the overfitting?

oppie85 commented 1 year ago

@1blackbar - looks great! Can you share what method you used to achieve this?

hopibel commented 1 year ago

Looks like they're doing some sort of face swapping/inpainting rather than generating the whole image from scratch

CodeExplode commented 1 year ago

When generating an embedding with more than 1 vector, is it possible to delete vectors and see what the difference is? Maybe training with a high vector count would be good, if we could then remove the ones which seem to be associated with features which we don't wat.

ExponentialML commented 1 year ago

When generating an embedding with more than 1 vector, is it possible to delete vectors and see what the difference is? Maybe training with a high vector count would be good, if we could then remove the ones which seem to be associated with features which we don't wat.

You can do this by training on a high amount of vectors, then do inference on a low amount. For example, you can train at 64, then infer at 32 or 16.

CodeExplode commented 1 year ago

Would that combine multiple vectors into one though, or truncate all the vectors after the lower count?

I've just started training a 24 vector model which resumed from a 4 vector training checkpoint, so it seems going up works at least.

CodeExplode commented 1 year ago

If you guys had a single instance of succefully finetuning a photo likeness of human being into SD with this code please share, ive yet to see that and im almost sure that this code is not to "inject" your own face into SD model as people might think.

Somebody just posted excellent evidence that this is possible in the Stable Diffusion discord's #community-research room, which I don't want to post a screenshot of in case it violates their privacy.

https://discord.com/channels/1002292111942635562/1003207327203209236/1017148584036147201

SeverianVoid commented 1 year ago

I changed num_vectors_per_token: to 10 and trained a model and it trained just fine but when I try to use the embeddings.pt to generate an image with the prompt "image of *" I get back an error "shape mismatch: value tensor of shape [10, 768] cannot be broadcast to indexing result of shape [0, 768]" how are you using the trained embedding after the fact with a greater than 1 num_vectors_per_token value. The image sample produced during training look quite good, I just cant seem to get the embedding file to work.

hopibel commented 1 year ago

I changed num_vectors_per_token: to 10 and trained a model and it trained just fine but when I try to use the embeddings.pt to generate an image with the prompt "image of *" I get back an error "shape mismatch: value tensor of shape [10, 768] cannot be broadcast to indexing result of shape [0, 768]" how are you using the trained embedding after the fact with a greater than 1 num_vectors_per_token value. The image sample produced during training look quite good, I just cant seem to get the embedding file to work.

You have to set num_vectors_per_token in the matching v1_inference.yaml config file to the same value (10)

rinongal commented 1 year ago

Would that combine multiple vectors into one though, or truncate all the vectors after the lower count?

I've just started training a 24 vector model which resumed from a 4 vector training checkpoint, so it seems going up works at least.

It would truncate the later ones

CodeExplode commented 1 year ago

Would that combine multiple vectors into one though, or truncate all the vectors after the lower count? I've just started training a 24 vector model which resumed from a 4 vector training checkpoint, so it seems going up works at least.

It would truncate the later ones

I'm not sure if anybody understands this, but do the different vector values tend to map to unique concepts? e.g. If I trained a t-shirt design on a biased set of caucasian wearers, in theory is there a part of the vector which is probably more associated with pale skin than t-shirts, which could be found and decreased?

One consideration I've had is whether defining "a white/european man/woman wearing a {}' in the personalized.py templates might help or hinder that issue. e.g. It would more often succeed with pale skin wearers, but it would also only try to generate them, so wouldn't get worse scores on other skin tones and learn to exclude them, maybe.

rinongal commented 1 year ago

I would not expect the multiple vectors to divide themselves cleanly into different concepts. You might be able to force this by doing something like our per-image-token approach (where the image-specific tokens tend to encode background information) or by using the version which introduces new tokens progressively.

We haven't checked this, either way.

On using training prompts like "a white/european man/woman wearing a {}": I've seen some posts that report better results when doing this. It makes intuitive sense (the model doesn't need to encode this information if it's already in the text), but if you use a large number of vectors for your token, it might just learn to capture this anyhow (because it might want to reproduce a specific shade of skin, for example).

CodeExplode commented 1 year ago

Does it seem that possible that even in a single vector, there are specific weights which mostly influence one aspect of the image? (from what I understand each vector is a set number of float weights which are fed to the model)

rinongal commented 1 year ago

Yes, that's absolutely possible, and it's probably actually the case. Or rather, not specific entries of the vector, but you can probably find directions in this vector space which will modify specific attributes.

You can see similar things in most GANs, where the latent codes develop such directions - and there's actually been a paper that shows this can happen with diffusion model conditioning codes as well. I don't think anyone tested it on word embeddings yet, but I wouldn't be surprised if they also have such directions.

1blackbar commented 1 year ago

This is ful txt2image , dont get discouraged, this is 60 vector finetuning, once you break through overfitting you can get some decent results and yest you can also inpaint just the face - works very well, these are reslts in a row - not cherrypicked So whhats the "sEcReT" i dont think there is, you just wait until you put that overfitting in a prompt so be mindful of that and do it quite late , this is the prompt - dawn of the dead comics undead decomposed bloody zombie , painting by greg rutkowski, by wlop by artstation zombie portrait of zombie slyf as a zombie YEh "slyf" is the embedding

dboshardy commented 1 year ago

Really fantastic stuff. Are you using a different codebase to do the inpainting?

1blackbar commented 1 year ago

This is not inpainting, its txt2img and to answer- its nicolai repo,fork of this one for inversion and inpainting is done in automatics webui

dboshardy commented 1 year ago

How many images did you train over? Did you end up going further than the paper's recommended 5?

I also wonder how much training is necessary to get certain levels of detail. I've been stopping at 10k global steps usually. You get some differences between different checkpoints, but I haven't gone higher than 10k. Did you find better results going higher?

1blackbar commented 1 year ago

that many , i think it was 15k itertations or 20 , vector count 60, "face" init word bt ill try to do "person"

dboshardy commented 1 year ago

Really cool work! Are those thumbnails cropped down and the original has the whole head or is it able to generalize the face to human heads?

1blackbar commented 1 year ago

no, nothing is cropped, i cropped just the eyes to lips, but IMO its not realy needed unless you do inpainiting on hires images and you need that detail, also more proof this is not inpainted- a grid result, yeh that bruce ranger was inpainting tho obviously It gave me more comic results as one by one , group are always weaker i noticed

So, i really keep 2 models, one for styechange thats mediocre likeness but great stylechange, second one is this one you see now, with great likeness but needs embed called pretty late in a prompt , i merge both when needed

dboshardy commented 1 year ago

i merge both when needed

As in you use img2img with the output of one into the other?

And are you referencing the dreambooth repo you linked here from?

There are so many projects in this space so fast, it's hard to keep up!

1blackbar commented 1 year ago

dreambooth is presumably better but it trains entire 4 gb model so yo endup with new 4GB model with one new subject added that you trained on, i prefere this repos method instead tho even if i have to do fix the face sometimes. I tried also huggingface one but had luck jsut on females which is weird, so id say use nicolai25 repo for inversion, give it 40 vectors or so and you will get great identity preservation very soon, once you see it you can stop finetuning Im still learning like we all, if you guys have methods im happy to learn from you, i love SD, as an artist this is a big revolution in the making IMO prompt - undead decomposed bloody zombie , art by greg rutkowski, portrait of blee as a zombie

1blackbar commented 1 year ago

OK ill document some stuff here, heres finetuning process on colab it already shown first great likeness after about 2500 iterations , with sly i had second sample showing as great likeness ( yes after 1000 iterations).

Bruce lee above is 5 vectors - base_learning_rate: 5.0e-03 init word "face" and 15k iterations

Settings - base_learning_rate: 5.0e-03

personalization_config: target: ldm.modules.embedding_manager.EmbeddingManager params: placeholder_strings: ["*"] initializer_words: ["face"] per_image_tokens: false num_vectors_per_token: 40 progressive_words: False

Training faceset , i will see another 2 scaled samples and if theyre good, training is done or... good thing is that it saves embeddings every 500 iters, so i can test which ones aready have too heavey overfitting and just pick one that i think is most versatile(identity leak vs stylisation).

Ok the likeness embedding is done , even after 3000 iterations it looks good , to change a style you need to finetune longer the stylized one you just do normally and i do it on huggingface currently.Its one vector.

You can use that embedding to fix faces cause likeness is 100%

But i have another embedding of this subject that has decent likeness and great stylisation ( no fixneeded in most cases) finetuned on huggingface diffusers inversion colab and heres results from it without any fixing

Ok now stylized one with decent likeness, lets do 2 vectors , "person" init word, and lets use v1-finetune.yaml on defaults but batchsize 1 , num workers 16 , colab on tesla P100. Same amount of images and all. After about 4000 iterations likeness gets better but not there yet.Editability is pretty great.Epochs thing dont really matter cause they depend on number of photographs and with one you get lot more epochs thatn with 5 with same iterations. As you can see scaled results are the only ones that matter and most recent one looks pretty good already but its half of the face in frame so... we will see how it goes on. OK gets better fast .. i will post more scaled samples result once i get satisfying likeness , im leaving all this info here as a guide also for myself in the future in case i forget what i tested and the outcomes... OK i noticed that macro head shots have likeness at about 90%, more away from camera its worse , 5000 one is great, but most recent one next to it is still wrong, more away from camera

rinongal / textual_inversion

Got weird results, not sure if I missed a step? #35