rinongal / textual_inversion

MIT License
2.89k stars 277 forks source link

Got weird results, not sure if I missed a step? #35

Open altryne opened 2 years ago

altryne commented 2 years ago

Hey @rinongal thank you so much for this amazing repo.

I trained with over 10K steps I believe, and around 7 images. (Trained on my face) Using this colab

I then used those pt files in running the SD version right in the collab and a weird thing happens, when I mention * in my prompts, I get results that look identical to the photos in style, but it does try to ... draw the objects.

For example : CleanShot 2022-08-29 at 14 04 48@2x

Prompt was portrait of joe biden with long hair and glasses eating a burger, detailed painting by da vinci and portrait of * with long hair and glasses eating a burger, detailed painting by da vinci

So SD added the glasses and the eating pose, but completely disregarded the detailed painting and da vincin and the style.

What could be causing this? Any idea? 🙏

1blackbar commented 1 year ago

OK continue in new post, this is 6000 itrations , i tested a style image

.. already got decent likeness with 7500 embeddings and a style but its still mixed, i also noticed from now on you have to put blahblahblah im an idiot!!! I went till 30k iterations kinda unhappy but guess what, i kept checking my pt files in webui, but... i kept using he same name thinkig nit updates automatically but NO! so i went 30k thinking what the heck ?! why samples look ok but sd resyuts are crap..... and then it hit me , i should of change the names to different one so it would LOAD embedding every time i bring it in... soooo. 5K result ladies and gents image image image image

12500 iters image image

Jim lee style image image davinci style image

OK frozen seeds now picasso image rembrandt image

Bouguereau image

Caravaggio image Delacroix image Frazetta image image Beksinski image

Your welcome ! IM from being totally sceptical to being blown away to the orbit today !

rinongal commented 1 year ago

@1blackbar Those are great! Really really happy to see you've got it working so well!

Just a heads up - if you use 60 vectors and a very long prompt, you're essentially doing what @CodeExplode suggested and truncating the vectors. This is because the model will only use the first 77 tokens. So if you have 30 for the prompt + 60 for the placeholder at the end, you'll only use the first 47 of those 60.

You can see it in this line (and the one above it), where we truncate the vector with [:n]: https://github.com/rinongal/textual_inversion/blob/5862ea4e3500a1042595bc199cfe5335703a458e/ldm/modules/embedding_manager.py#L124

1blackbar commented 1 year ago

You know what, i might not need 60 vector embeddings, the one i just did is magnificent, love the code guys !!!

CodeExplode commented 1 year ago

Yeah I've suspected part of the reason you're able to overwhelm the style of your embedding when putting it at the end is because it's losing some of the embedding vectors. Interestingly however it seems that all the important parts were stored in the leading vectors.

I suppose it's possible that if your personalized.py prompts were long enough, it was only training on the front half of your embedding vectors. That could be a handy way to discard vectors which catch other elements of the training set.

1blackbar commented 1 year ago

ah the prompt for these was - painting by rembrandt of centered head close up of teenage ewel5 I just changed rembrandt into another artist and magic happened. image image image image image image image

CodeExplode commented 1 year ago

You know what, i might not need 60 vector embeddings, the one i just did is magnificent, love the code guys !!!

I just accidentally ran a 50 vector embedding with only 20 enabled and noticed it didn't do much. After eliminating the first 20 and only using the last 30, the results were still pretty good, maybe even better. Even just using a few of the vectors from near the end seems to give a lot of the correct result. Some vectors in isolation, or in pairs with the vectors beside them, seemed to define noticeable features of the training object, such as material, colour, texture, shapes, etc.

My current plan is to try to see if I can update training to occasionally test each vector and find the one which seems to have the least impact on the results, and remove it, lowering the vector count each time to the minimum number required to get good results. It might be better if it's a manual review process for difficult to embed cases, where say 50 images per excluded vector are generated, and a manual decision can be made to remove one or more vectors, then keep training.

1blackbar commented 1 year ago

I noticed that from 10 vectors the likeness gets resolved very quickly during finetuning, on first sample previews.Not perfectly but much faster than lower ones. Your idea sounds good, id also would like to know, what kind of learning ratio will let me to get even more likeness but would cost more time ? default one is 5.0e-03 , huggingface has base_learning_rate: 5.0e-04 , they got some of my subjects great with just one vector on this ratio, how to put this ratio into extreeme mode so it will work even better (costing more training time) ? Im gonna do a test with plain white background so it wont waste time learning it.

hopibel commented 1 year ago

@1blackbar Note that learn rate is multiplied by gradient accumulation steps and the example script on huggingface has it set to 4 while this repo uses 1 by default

1blackbar commented 1 year ago

So how i can get same settings with this code so its like on huggingface ?Is that the same code basically?

Also guys, i had another success using identical settings - person init word, 2 vectors, at abouyt 10k identity kicked in, its so great! with just 2 vectors !!!

rinongal commented 1 year ago

@hopibel Probably hit the nail on the head. Huggingface uses more gradient accumulation steps, which means you're working with a larger batch size and are less likely to fall into minimas like overfitting the background of a specific image with your tokens.

1blackbar commented 1 year ago

where i can change the steps in this repository ?

hopibel commented 1 year ago

Set accumulate_grad_batches at the end of the yaml config, right next to max_steps

CodeExplode commented 1 year ago

Has anybody had any luck setting accumulate_grad_batches higher? With a value of say 4 I ran into errors like testing being delayed until 4 times as many iterations had passed, then 4 rounds of testing are done in a row.

1blackbar commented 1 year ago

i did, it helps with identity a lot, why was it removed from yamls, i dont know, im still testing, 850 iters showing very good identity takes like 4 times longer tho benchmark: True accumulate_grad_batches: 4 max_steps: 100000

1blackbar commented 1 year ago

OK, heads upo, iddentity might be ok but stylisation is crap, i think this is overfittin waaay faster but with low iterations, nost sure if its a way to get editability with batches of 4,, maybe with mnuch sower training rate combined, oi dont know The thin is, i type in frazetta paingin of subject, and it changes to a painting but its no way a frazertta

hopibel commented 1 year ago

@1blackbar If you didn't change any other settings, you basically quadrupled the learn rate due to --scale_lr

ThereforeGames commented 1 year ago

OK, heads upo, iddentity might be ok but stylisation is crap, i think this is overfittin waaay faster but with low iterations, nost sure if its a way to get editability with batches of 4,, maybe with mnuch sower training rate combined, oi dont know The thin is, i type in frazetta paingin of subject, and it changes to a painting but its no way a frazertta

Is this not an issue with huggingface? And I assume you changed the learning rate here to 5.0e-04 as well, yeah?

Sounds like there's still something different about their config.

1blackbar commented 1 year ago

from what i see their learning ratet changes with time , also accumulate changes Also for some reason 3 females finetuned fine on hface , but males not that same level, go figure Also the embeddings we finetune now, work on other ckpt 4GB files that people are training for themselves so thats good.

rinongal commented 1 year ago

@1blackbar Where are you seeing their learning rate changes with time? They appear to be setting the LR scheduler to constant mode by default.

Accumulation steps is basically saying: "I can't fit the full batch size in the GPU, so instead of doing a batch of 4 images, I'll do 4 batches of 1 image and accumulate the results", hence why it takes more "iterations". I wouldn't expect this to cause you more overfitting, other than any adjustments it makes to your LR.

ThereforeGames commented 1 year ago

Hi all,

I wrote a new script that effectively circumvents overfitting from Textual Inversion:

https://github.com/ThereforeGames/txt2img2img

Combine it with prompt weighting for the best results.

Would love to know your thoughts. Thanks.

hopibel commented 1 year ago

@ThereforeGames Would be interesting to see how this compares to prompt2prompt, which replaces concepts mid-generation

ThereforeGames commented 1 year ago

@hopibel Agreed! I haven't had a chance to play with prompt2prompt yet, but I have a feeling there's probably a way to integrate it with txt2img2img for even better results.

prompt2prompt seems amazing for general subject replacement, but I'm wondering how it fares with "ridiculously overtrained" TI checkpoints.

1blackbar commented 1 year ago

Currently i have best results by inpainting heavily overfit face on stylised result (img2img inpaint in webui) Automating that would be interesting but i feel that having more manual control over result is just better uness it can force styles into overfit embedding so they dont all look like inpainted photolikeness on a cartoon versions

CodeExplode commented 1 year ago

I can confirm that the above txt2img2img approach works better than anything I've tried such as inpainting, having played around with it for a while on discord. It can do style, pose, and background changes on an embedding which otherwise would always overwhelm those prompts, and makes it very easy.

It has an autoconfigure setting which I turned off and which didn't work well in one attempt, but the author has mentioned it being very powerful and so the script might be even better than what I've seen so far, which is already great.

nerdyrodent commented 1 year ago

Nice. About to test it with my highly over-fitted embeddings which take ~10 mins to produce ;)

CodeExplode commented 1 year ago

p.s. We've been talking non-stop about textual inversion in the #community-research channel on the stable diffusion discord for days now, if anybody wants to join in. The script author gave me some tips getting it working which might be worth checking out if you have trouble.

https://discord.gg/stablediffusion

ThereforeGames commented 1 year ago

hopibel I wrapped my head around Automatic's implementation of prompt2prompt - you can use it with txt2img2img now in the form of a custom prompt template

So far I haven't figured out a way to use prompt2prompt that yields better results than my default prompt template. It often does a better job with background detail, and perhaps with editability, but likeness seems to suffer a bit.

Feel free to play around with it and let me know if you find a formula that works! The prompt templates are like a primitive scripting language so you can do a lot with them - check docs for more info

ExponentialML commented 1 year ago

@ThereforeGames Would be interesting to see how this compares to prompt2prompt, which replaces concepts mid-generation

I was going to make a separate issue about this, but Cross Attention Control and prompt2prompt are the solutions for the overfitting / editability of prompts. In my testing, I've had extremely good results (I primarily use the Dreambooth implementation with my custom script, but textual inversion works too).

What happens is that the newly trained word is often prioritized as the init token, so you replace it with at x steps at inference.

So if you have a Mustang 2024 trained for instance, you could do something like a photo realistic art piece of a [car:*:0.2] driving down the road, high quality, where car is the init, * is the trained token, and the trained word replaces the init at 20% of the inference process. You usually have to scale the percentage up or down with the amount of steps you choose.

It's the same idea as txt2img2img, but without the img2img process.

ThereforeGames commented 1 year ago

It's the same idea as txt2img2img, but without the img2img process.

Sick, that actually works quite well. I need to perform more tests but so far it's more or less on par with my script - sometimes even better.

ThereforeGames commented 1 year ago

Okay, now that I've had more time to play around with prompt2prompt, I can say that it generally yields "higher quality" pictures but the likeness isn't always as good as tx2img2img. Here's an example where I could not get a Sheik-looking Sheik out of prompt2prompt:

image

Versus txt2img2img:

image

In the first one, the facial features and expression aren't right. I played around with the ratio from 0.1 to 0.3, but couldn't get it looking much better. Tried CFG scales from 7 to 15. Seems that likeness goes down as prompt complexity goes up.

It might help if we could autotune the CFG and prompt ratios automatically, over the course of inference, but I'm not sure how to go about doing that. txt2img2img has the advantage of being able to look at the result of txt2img before processing img2img.

Would love to figure out a way to combine the high level of detail and speed from prompt2prompt with the consistency of txt2img2img!

ThereforeGames commented 1 year ago

Here's another example - 1st is prompt2prompt and 2nd is txt2img2img:

download - 2022-09-16T223918 191

download - 2022-09-16T223921 583

If anything, the clothes might be better in prompt2prompt... but the face is way off!

1blackbar commented 1 year ago

prompt2prompt does work in webui from AUTOMATIC1111 , it works i think better with embeddings that were below 60 vectors

a painting by greg rutkowski of close portrait shot of [sylvester stalone :slyf:0.5] on neon city background , by greg rutkowski

image image

image