rinongal / textual_inversion

MIT License
2.93k stars 280 forks source link

Best learning rate to immediately overfit asap ? #96

Open 1blackbar opened 2 years ago

1blackbar commented 2 years ago

What should i set it to? I need overfitted embeddings to get 100% likeness with like 50 vectors but sometimes i have to wait for too long until it gets there. I put base_learning_rate: 10.0e-01

rinongal commented 2 years ago

Can you tell me a bit more about what your setup is?

Do you care at all about the size of your files? You could use our unfrozen_config and just tune the embedding along with the model. This will probably overfit much faster.

If you do care about file size, In LDM I started with an LR of 1.0e-1 and only had to use 2k iterations (instead of 5k) with a single vector. It's not instant, but it seems to be a good spot for cutting times.

Extremely high LR might actually make it harder to converge since the model will 'overshoot' the good spot.

Alternatively, if you have a big set of similar things and you need embeddings for all of them, you might be able to save some time by starting the optimization of one from the results of the other.

1blackbar commented 2 years ago

I want to do human face likeness but it takes time to do it and my goa is to overfit with llike 40 vectors as soon as possible, i start with default settings in yaml but batch size is 1, and no accumulation of grad batches. I do train dreambooth with entire weights and prune to 2GB but i prefere like 40-60 vector embeddings to get that perfect likeness that i dont have with dreambooth sometimes despite great stylisation. Filesize is important for me thats why i think inversion is very useful despite dreambooth getting great results on free colab in 20 minutes and 15 images.Filesize of inversion ( below 300kb for a lot of vectors) is a winner as well. For Inversion i use like 6 images of the face, same expression ,4 head shots 2 face shots. Started one with 1.0e-1 , 40 vectors , we will see how fast its going to get likeness. According to this, the best would be 0.1 ? image

From https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/

rinongal commented 2 years ago

Yep. This seems like 0.1 is indeed the sweet spot. You could try playing around that region (e.g. maybe 0.2 would be better), but here you've got the quickest convergence while not being so noisy that it fails.

You might be able to go a bit higher with learning rate if you incorporate things like EMA (exponential moving average) for the learned embeddings, but it might be a bit complex to implement and I'm not sure it will make a big impact on your training times.

rinongal commented 2 years ago

Another thing that might help: Change this line https://github.com/rinongal/textual_inversion/blob/0a950b482d2e8f215122805d4c5901bdb4a6947f/ldm/modules/embedding_manager.py#L59 To: get_embedding_for_tkn = partial(get_embedding_for_clip_token,embedder.transformer.text_model.embeddings.token_embedding)

(e.g. add .token_embedding at the end of the second argument)

CodeExplode commented 2 years ago

Did you end up deciding on an answer for this blackbar? The original CLIP embedding weights only vary by about 0.12 from the lowest to the highest value per weight, so it seems like a .1 learning rate would massively overshoot the distribution of where everything the model trained on sits.

rinongal commented 2 years ago

They are trying to overfit - so overshooting the distribution is not a concern. They want to get to a point where the likeness is best as soon as possible, even at the cost to editability.

Wangt-CN commented 1 year ago

human face likeness

Hi @1blackbar , may I wonder that how did you compute the "likeness", the CLIP similarity as said in author's paper?

Wangt-CN commented 1 year ago

Another thing that might help: Change this line

https://github.com/rinongal/textual_inversion/blob/0a950b482d2e8f215122805d4c5901bdb4a6947f/ldm/modules/embedding_manager.py#L59

To: get_embedding_for_tkn = partial(get_embedding_for_clip_token,embedder.transformer.text_model.embeddings.token_embedding) (e.g. add .token_embedding at the end of the second argument)

@rinongal Hi, author. The change of this line of code denotes removing the position_embeddings? Why this change might help the over-fitting?

rinongal commented 1 year ago

That line is actually a 'bug fix' for the stable diffusion version of our code. If you don't remove the positional encoding, your initial embedding isn't actually the word embedding that matches the coarse class, but something shifted away by the positional embedding.

In general there's minimal impact for the starting position when you do a long training run, but if you're trying to overfit quickly and trying to save iterations, you want to start from as-close-as-possible to your target, and this fix helps with that.