rinongal / textual_inversion

MIT License
2.9k stars 279 forks source link

Model training effect is not good #71

Closed Lufffya closed 1 year ago

Lufffya commented 2 years ago

My training data IMG_0855 HEIC IMG_0856 HEIC IMG_0857 HEIC IMG_0858 HEIC IMG_0859 HEIC

logs ouput images

--- Epoch 2 --- samples_scaled_gs-000500_e-000002_b-000000

--- Epoch 10 --- samples_scaled_gs-002500_e-000010_b-000000

--- Epoch 20 --- samples_scaled_gs-005000_e-000020_b-000000

--- Epoch 30 --- samples_scaled_gs-007500_e-000030_b-000000

Is there any way to make it better?

oppie85 commented 2 years ago

I believe I read that input images from multiple angles is actually detrimental to the training process. I think you'll have more success if you made pictures from one angle but against different backdrops/surfances.

In any case I don't think we should expect miracles from Textual Inversion (for Stable Diffusion) right now; there's a lot of experimentation going on to find out what the optimal settings are and how to get more accurate results. For some objects we may even never get good results because what Textual Inversion can produce is limited by what was in the original SD training data.

1blackbar commented 2 years ago

welll . photos are quite bad, i wouldnt make it out as an artist, can tou place it according to straight isometric lines ? Rotate them, so the cats face is on top and you know, like hes standing, i had realy hard time figuring this out as a human so....

ThereforeGames commented 2 years ago

photos are quite bad, i wouldnt make it out as an artist, can tou place it according to straight isometric lines ?

I think there's also some lens warping going on, which I've noticed can have a very detrimental effect on the likeness of human subjects. (i.e. one warped photo of your subject and SD will try to make a completely different-looking person.)

That said, Luffffffy, your results appear to be getting better after 30 epochs. The bottom left picture is starting to look like a screen with a cat-like frame. Some of my finetuning experiments required 50 or 60 epochs before achieving a reasonable degree of fidelity.

Also, what's your init word and number of vectors?

Lufffya commented 2 years ago

I believe I read that input images from multiple angles is actually detrimental to the training process. I think you'll have more success if you made pictures from one angle but against different backdrops/surfances.

In any case I don't think we should expect miracles from Textual Inversion (for Stable Diffusion) right now; there's a lot of experimentation going on to find out what the optimal settings are and how to get more accurate results. For some objects we may even never get good results because what Textual Inversion can produce is limited by what was in the original SD training data.

I see. I also thought about whether it was limited by the data set of SD. Later, I changed another common item to train, and changed the background of the photo, and the result looked much better. But whether it is necessary to keep all photos at the same angle? Maybe only three photos are needed. Thank you

Lufffya commented 2 years ago

welll . photos are quite bad, i wouldnt make it out as an artist, can tou place it according to straight isometric lines ? Rotate them, so the cats face is on top and you know, like hes standing, i had realy hard time figuring this out as a human so....

well. In this way, I also found that my photos are too bad, I'll take some new pictures as training sets, and try to keep them at the same angle. Thank you

Lufffya commented 2 years ago

photos are quite bad, i wouldnt make it out as an artist, can tou place it according to straight isometric lines ?

I think there's also some lens warping going on, which I've noticed can have a very detrimental effect on the likeness of human subjects. (i.e. one warped photo of your subject and SD will try to make a completely different-looking person.)

That said, Luffffffy, your results appear to be getting better after 30 epochs. The bottom left picture is starting to look like a screen with a cat-like frame. Some of my finetuning experiments required 50 or 60 epochs before achieving a reasonable degree of fidelity.

Also, what's your init word and number of vectors?

In fact, I trained more than 200 epochs, but nothing is better than 30 epochs. It seems to be the problem of this training set. SD model has never seen such pictures. I guess All parameters remain default, because I don't know how to adjust it. init word is tablet (the full name is the LCD writing tablet, but it seems that multiple words cannot be set )

rinongal commented 2 years ago

@Luffffffy As others have stated, I'd try to make sure the images are roughly the same angle. Specifically, try to make sure the cat head is facing up (like in your first image). Feeding the model images rotated by 90 degrees tends to cause a mess.

Lufffya commented 2 years ago

@Luffffffy As others have stated, I'd try to make sure the images are roughly the same angle. Specifically, try to make sure the cat head is facing up (like in your first image). Feeding the model images rotated by 90 degrees tends to cause a mess.

Thanks for your reply, I'll try.

GucciFlipFlops1917 commented 2 years ago

To add on from my experience, it's a balancing act between supplying variation and receiving coherent reconstructions. That's between the aforementioned element of camera angle and background as well as how many images you supply. In cases where objects/styles are similar enough but just different, adding >5 images can help.

rinongal commented 1 year ago

Closing due to lack of activity. Feel free to reopen if you still need help.