Code explanation? - Githubissues

1blackbar commented 1 year ago

Can You explain why in yaml file defaults are this -- target: ldm.modules.embedding_manager.EmbeddingManager params: placeholder_strings: ["*"] initializer_words: ["sculpture"] per_image_tokens: false num_vectors_per_token: 1 progressive_words: False

So that means i am trainig a sculpture even if my --init word is "mydog sketch" ? I dont quite get it, should i remove the "sculpture from there ? Where are instructions about all this ? Was it skipped cause its not important ? Id prefere to know before i spend days on training and i realise i trained a sculpture not a sketch drawing

Ooooor , these are just defaults if you dont use --init word and theyre ignored once you use init word ?

So next question--------------------------------- I want to train Stablediffusion model on images of he-man from filmation cartoon( cause now he-man doesnt look like him much in stable diffusion) , sooo, i have --init word "he-man filmation" and img folder with images from cartoon and now questions.... Will this make other versions of he-man to improve in any way ? ( when i prompt he-man filmation photorealistic, film still ) Or only this he-man filmation cartoon version wil be improved when i hit two words fron --init words and not just one ? What if i prompt only he-man without filmation ? WIll embedding checkpoint affect my results too ? I want to eventually train on more realistic images of he-man , should i do it separately ? Shouuld i just bring all he-man images and train on them at once with realistic, cartoony ? I cant find info on this anywyere.I want to basically force out current bad looking he-man images and bring in good looking images by embedding, how i can do that so it wont generate cartoon and realistic he-man images from original stable diffusion model but rather from finetuned checkpoints where he looks like he should ? So when i train with --init word he-man filmation, should i adjust something in yaml file ? like initializer words or something else ? Or is the command line with --init word enough ?

Next question ------------------------------ I can see some empty white images in preview during training with text on them "a photo of my *" , why theyre white ? I dont get this part at all, am i training this wrong way ? Should they show me he-man images ? I have to add that i can see he-man images as well in this folder ( on reconstruction gs images and input gs images ) but im concerned about the white ones named "conditioning" , what are they ? What are images called "reconstruction gs" ???? What are images called "inputs gs" ???? What are images called "samples" ?? Are those actual results from the model with finetuned data being used when generating ? I know that samples are made with makesamples.py but it does not contain any info on what samples are and how they might guide you. Sorry lot of questions but i want to get this right and the paper does not have any info about all that, where is that info so i can read what each image means for finetuning ? How i can resume training? just by running it again with same command ?

ExponentialML commented 1 year ago

The basis for a lot of your questions are addressed at this issue.

1blackbar commented 1 year ago

Thanks so its what i suspected, , --init word just overwrites whats in yaml so thats good that im not training a sculpture, im wondering about other questions tho, ill be training just cartoon version for now and see if it will affect other non cartoon styles.

rinongal commented 1 year ago

Config file:

Most of these extra parameters are outlined in the actual paper. They are just flags for enabling our baselines.

Here's what they do:

placeholder_strings: The actual string that will represent the concept in future prompts. This must be a single token string, so don't use things like "myplaceholderword".
initializer_words: The words used as the initialization for the 'new word' embedding. This is not the placeholder, and will be overwritten by whatever you provide with --init_word when running main.py. It's just a starting point for the optimization. It should be some rough description of your concept's main class, for example 'dog', 'face', 'painting' etc.
per_image_tokens: This gives each image in your training set (up to ~20ish images) its own unique token, in addition to the shared placeholder. This way, the model can try to capture the shared information in "*", and the image-specific information (background etc.) in the specific tokens. For LDM this led to worse results (some information about the shared concept was instead stored in the per-image tokens).
num_vectors_per_token: This is the number of embedding vectors used to represent the concept. Basically how many 'words' is the concept represented as behind the scenes. More words = more expressivity, but also more overfitting so its hard to later edit images.
progressive_words: If you are using more than one vector per token, you can enable this to increase the number of vectors progressively over training. So you'd start with 1 word which will capture the concept as best as it can, and after a set number of training iterations, the model will move to using more and more vectors. This is meant to give the same flexibility as the multi-vector approach, but possibly reduce overfitting. This also didn't work well with LDM, so its off by default.

Training on He-man images:

We didn't train on mixed modalities. You're free to give it a try and see what works best for you. Don't use 'he-man filmation' as your init word because it's not a single token word. Try something like 'cartoon'. You don't need to adjust the initializer words in the config if you use the --init_word flag. It's there exactly for this reason - to let you avoid editing the config every time.

Output folder images:

gs stands for global step, it's the step of the training where the images were produced.

inputs: These are just the input images provided to the model at this iteration
reconstruction: If you read the LDM/SD paper, you'll see that part of the model is a network that knows how to compress and uncompress the images into some latent space, where the actual diffusion model is performed. The reconstructions show the input images after this compression and uncompression step, so you can make sure its loaded correctly / see what small details might be lost at this step.
conditioning: These are the conditioning sentences used when sampling images for validation. It literally just prints out the text that was used to produce each of the samples in the same step.
samples: Samples produced at this step with the conditioning texts but without classifier-free guidance (e.g. the guidance scale is 1.0). These will be mostly noise.
samples_scaled: Samples produced at this step with the conditioning texts and with classifier-free guidance (guidance scale is 5.0). These should look like your concept.

The output you want to track is samples_scaled. Everything else is mostly for debugging purposes.

1blackbar commented 1 year ago

Thanks a lot , those are very helpful , so if i dont have to specify that its he-man on images by --init word at all - does the model know that its him ? How come ? This part is confusing, also can i somehow force out trained data from original 4 GB ckpt weights , i think they tagged a lot of images as he-man in dataset but it wasnt him and it corrupts the results too much. I tried to resume the training of the model but it just starts over, how do you resume , whats the command line to do it properly ?

Also is there a specific way to name your images so it knows that it represents tha back view, side view, portrait view ? DO i have to name the images at al or it can be gibberish and it wil work the same? Any rules to making best image dataset to help with training ? How to "tell" it that you are training a specific person like "chuck norris" , it cant just "know" thats him if i put up some random bearded man images in the set ? How does that work ?

rinongal commented 1 year ago

We don't currently have resume support. There's a couple of requests for this so we'll have it done over the weekend.

We're learning a new word that represents he-man, using the images you're providing. So it shouldn't matter if the pre-trained model saw unrelated images tied to this text. What you'll have at the end of the day is a new word, '', which represents your images, and you can use in prompts like "a photo of " or "an oil painting of * hanging on the wall".

If you provide images of some random bearded man, it will learn the random bearded man, not Chuck Norris.

I don't think it will work very well if you have too many different viewpoint images. I'd try to focus on frontal stuff.

1blackbar commented 1 year ago

So the best way to train he-man in my case is to just use --init word "cartoon" and dont even name images in a specific way at all ? Cause original LAION dataset that stablediff was trained on had images and also a text describing whats on the image, isnt this supported or its not needed ? https://huggingface.co/datasets/laion/laion2B-en Also i had 25k iterations , it picks up quite slow but it does slowly look like him, is it possble that the training can last a few days ? At the beginnign stages it does not look like him at all but then starts resemble him but still colors are wrong etc. and armor does not look like his.Thats why im wondering am i doing something wrong, i read in few places that you dont need to train for long but how come if it still does not look like him after 25k iters, can i train as long as i can until he looks perfect ? is that even possible that he will look identical to original input images ? Is having 20 images really worse than 5 ? It looks like that from the paper, 25 is way off while 5 is best editability and resemblance... will this improve in the future ?

So you say that i should use photo of , so what i f i trained another person from he-man cartoon, and hes using as well, if i want to merge two embeddings and they both were trained with identical settings, how do i even tell it that i want image of that first person and not the second person if theyre both , should i use for first and # for other one ? Wheres a list of acceptable symbols ? Can $ ,%,& be used ? I come from gamedev background and do a lot of debugging, hence my logic about all this as one mistake can collapse everything. Why i shouldnt use back view images? does it require extra code to differentiate front from back and confuses training ? Will it eventually come in the future ? I presumed that i name my images , portrait, front, back , low angle, etc and it helps training.I also have creatures that look nothing like humans( a pile of mud with eyes and mouth), and even front angle looks nothing like anything out there, but still looks quite specific to that subject , what about these, its a lost case scenario ? I had assumption that you have to be very very specific explaining the AI what you want to train and what camera angles images represent as its how we have to prompt by using low angle, high angle, portrait, macro close up... etc. Ultimately my goal is to build a library of pt files , one pt file for each subject and use them all at once by merging embeddings, i will run out of symbols pretty quickly, Got a suggestion that i should skip --init_words and just edit yaml file and use initializer_words:["he-man","cartoon","filmation"], in for my case, is that the best way to go about it ? So kinda like google searching? I want to be specific cause there were other cartoons as wel where he looks totally different, so im using "filmation" for this, later i want to merge this pt file with other he-man versions.and use all he-man versions at the same time while still making it work for each version. My goal is to get best resemblance , i dont mind triaining for long or edit text files. Can i train a face of he-man separately on 5 images ,then train on his armor images only ( just chest armor images and no face no legs )separately on 5 images and merge two embeddings to get the best results ? Was this tested ? Then training his boots, his back only etc, kind of modular approach to build embedding from a couple embeddings trained separately Anyone tested stacking embeddings of the same subject? so i train on 5 images of face and use , then train on other 5 images of face ( with more expressions) and use # then i merge embeddings ( or should i use for both training sessions and merge? What is merging actually ? a blend if you use * as placeholder on all pt files that i want to merge ? ), i have to test if this wil improve something ( particularly identity/likeness)compared to just givingf it all at once during one training session.If it does, would you consider this kind of splitting in your code into more detailed training on particular parts of subject?

rinongal commented 1 year ago

Sorry, completely missed this followup.

You don't need to name your images. Using just "cartoon" for the init word is good, yes. What we're doing is learning a 'word' that describes the object. You don't need to give us this description.

25k iterations seems a bit much. I wouldn't expect it to keep improving that far. Most of the time, ~5k iterations is enough. Identical results to inputs - it's possible if you do things like increasing the number of vectors per token in the config, but this will also cause a sort of overfitting where its harder to change the subject with new text.

You can certainly put your init_words in the config instead of the --init_word argument, yes. I don't think the tokenizer knows filmation and he-man so those will likely be multi-token words and will cause problems. Using "cartoon" is absolutely fine. We didn't try training on different body parts and merging them, but I'm not sure it will work. Using multiple subjects in one prompt tends to merge them semantically, not place them side-by-side.

1blackbar commented 1 year ago

Sorry, completely missed this followup.

You don't need to name your images. Using just "cartoon" for the init word is good, yes. What we're doing is learning a 'word' that describes the object. You don't need to give us this description.

25k iterations seems a bit much. I wouldn't expect it to keep improving that far. Most of the time, ~5k iterations is enough. Identical results to inputs - it's possible if you do things like increasing the number of vectors per token in the config, but this will also cause a sort of overfitting where its harder to change the subject with new text.

You can certainly put your init_words in the config instead of the --init_word argument, yes. I don't think the tokenizer knows filmation and he-man so those will likely be multi-token words and will cause problems. Using "cartoon" is absolutely fine. We didn't try training on different body parts and merging them, but I'm not sure it will work. Using multiple subjects in one prompt tends to merge them semantically, not place them side-by-side.

Wouldnt using "identity" or " likeness" or "male" be better even tho its a cartoon version? im training a character not a cartoon style, anyone figured out a best universal way to train a likeness while staying editable into other styles ? I did that stallone test and rsults were great but its 50 vectors and this is way too much, with 20 vectrs likeness got worse, what is other way to improve likeness ? https://github.com/rinongal/textual_inversion/issues/35 Are there any ways to separate subject from the style when finetuning so its not that heavilyu embedded into photo realistic ? Is there a way to tell how many vectors for their likeness take lets say trump or zuckerberg in SD weights ? i tried with 2 but its getting there very slowly

rinongal commented 1 year ago

You can combat the overfitting by using the weighted prompts approach from the webui repo. I'm not sure how well it would work when you're using 50 vectors, but you can try giving the rest of the prompt a very high score compared to the part with the placeholder ("*"). Regarding the initial words: I don't have a better tip than just to try. In practice your results are probably more sensitive to your initial seed than they are to this choice of an initial word.

1blackbar commented 1 year ago

i compared 2 vectors ( 36k interations) vs 5 vectors(5k iterations) it takes the same amount of overwhelming it with words representing style, theres practically no difference , just takes longer to train 2 vector one to get similar likeness as 50 vector one.So i think in the end it does not matter much how many vectors you use if you want to preserve a style- you will pay with overfitting but you will get there faster with 50 vectors. I did managed to change from photo to oil painting but when changing to anime, stallone mixed with japanese man. I dont know but i think it randomy takes away from identity tokens when you force the style, and you lose identity but style transfer is more possible, thats not always the case tho, i could get better identity with ilya repin oil painting. All in all is this a final version of this approach or some new ones are coming to battle overfitting and identity preservation issue ?

rinongal / textual_inversion

Code explanation? #19