Documenting Textual Inversion Training and Results

Any-Winter-4079 commented 2 years ago

This is a request to add more documentation / create a guide with useful tips for the Textual Inversion process, by people who have successfully got it to work. For context, this is the current guide for Textual Inversion.

We have finally been able to train on M1, and so far we have mixed (initial) results.

Among other things, we seem to be losing context, e.g. after training on 3 burgers "a close-up photo of * in the style of Van Gogh" -s15 -W512 -H512 -C7.5 -Ak_lms -S1380320324

but if we request something different, like a cat "a cat in the style of Van Gogh" -s15 -W512 -H512 -C7.5 -Ak_lms -S1380320324

and if we try to mix both "a close-up photo of a * being eaten by a cat in the style of Van Gogh" -s15 -W512 -H512 -C7.5 -Ak_lms -S1380320324

Some observations:

It's not the same burger it was trained on (e.g. if I trained with my face, this would be the equivalent of showing another person's face).
It loses context (e.g. if I use 'in New York', 'being eaten by a cat'... it seems to be ignored).

I have tried some pretrained models, e.g. Ugly-Sonic, and it does seem to keep context better. At low steps, it seems to struggle with bodies/faces: "<ugly-sonic> in front of the Statue of Liberty in New York, artstation, 8k, high quality" -s10 -W512 -H512 -C7.5 -Ak_lms -S667265387

But using same seed but 10x more steps "<ugly-sonic> in front of the Statue of Liberty in New York, artstation, 8k, high quality" -s100 -W512 -H512 -C7.5 -Ak_lms -S667265387

Also, if I request something unrelated to what it was trained on, it does produce those images "a cat in the style of Van Gogh" -s100 -W512 -H512 -C7.5 -Ak_lms -S48976874

The original repo says:

In the paper, we use 5k training iterations. However, some concepts (particularly styles) can converge much faster.

which is not a lot of information.

I think it would be great if we could see metrics from successful attempts, and see val/loss_simple_ema values (e.g. what values it reaches, how they progress per Epoch, etc.). Sharing the images folder may also be useful, as it includes train and val images (some of which for me were reconstructions of my training images, others were new burgers, while some were unrelated images and others were black).

Some more information can also be added, like could we change

    'a photo of a {}',
    'a rendering of a {}',
    'a cropped photo of the {}',
    'the photo of a {}',
    'a photo of a clean {}',
    ...

to 'a photo of a {}'only, for example? And if so would it be quicker to learn? Would it learn at all? Could we reduce the DDIM sampling from 200 steps to say 100 steps? Would it learn? Could we change the sampler? Could we change val/loss_simple_ema for other metric?

As a summary of some of the questions:

What images did you use for training?
How many images were in your training set?
Is you model capable of keeping context, e.g. generate * interacting with other objects or places?
Is your model capable of learning your exact concept (e.g. a specific person vs. the concept of a person)?
How many epochs did it train for?
What train and val images did you get?
What metric (e.g. val/loss_simple_ema) did you use?
What val/loss_simple_ema (or chosen metric) values did you get per epoch?
What learning rate were you using?
What sampler did you use?
For how many steps did your sampler run?
Did you modify personalized.py in any way?
Any other info that may be useful, tips, etc.

Thanks a lot!

krummrey commented 2 years ago

I'd love to help. I was able to train a model on hugginface with a .bin as a result. But I never got it to load into dream.py. I also trained a few models using this repository but the results also didn't load into dream.py I'd love to beta-test any code you through at me. :)

Any-Winter-4079 commented 2 years ago

You can load it like this python3 ./scripts/dream.py --embedding_path model.bin with model.bin saved inside the stable_diffusion folder. For example, here is the Ugly Sonic bin file https://huggingface.co/sd-concepts-library/ugly-sonic/tree/main

Not sure why it's not working for you. Does it work with the bin file from Ugly Sonic?

I'd be interested to see what results you get. For example, if you run, e.g a photo of * in New York a photo of * on the beach etc.

Also, if capable of reproducing what it learnt (*) plus context (e.g. specific background), it'd be interesting to see the number of epochs you trained for, the val/loss_simple_ema (if you keep it) you got as you trained, etc.

Any-Winter-4079 commented 2 years ago

About code, these are our changes vs. development branch https://github.com/lstein/stable-diffusion/compare/development...tmm1:dev-train-m1 With that code we are able to run on M1 + load the resulting file (.pt) into dream.py Maybe you could use that code/those changes to re-train to see if now it loads?

krummrey commented 2 years ago

I'm using the original Hugginface Colab to do the number crunching. Import into dream.py works for the .bin files created. I've focused on new styles, trying out some that I've trained and some of those trained by others in the Hugginface library. What I've found so far. The new styles are very strong and work well on the subjects they were trained on. If I train on portraits, the generated new portraits have a fairly good style transfer. When I try to generate other subjects I get a mixed bag of results. My styles trained on portraits will try to include a person if it is at all possible. I ask for an image of a hamburger, I get a person eating a hamburger. Negative prompts do not seem to have an effect. If the training images are just the slightest bit NSFW, pretty much all of the generated images will not pass the NSFW filter if it is enabeled. Even if I use a totally safe prompt.

So for styles it seems that they work best on the subjects they were trained on. More so than the general model. The general model has less problems of generating a hamburger in any style.

invoke-ai / InvokeAI

Documenting Textual Inversion Training and Results #699