Unsatisfactory human pose training result

buaacyw commented 1 year ago

Thanks for your great contribution! I'm trying to train a ControlNet conditioning on human pose (openpose 18 keypoints like in your provided demo). My dataset is DeepFashion, which contains 10K full-body images with long text annotations like these:

WOMEN-Sweaters-id_00004141-06_1_front WOMEN-Tees_Tanks-id_00004682-01_4_full

"This guy wears a long-sleeve shirt with solid color patterns and a long trousers. The shirt is with cotton fabric. The neckline of the shirt is round. The trousers are with cotton fabric and solid color patterns." "The upper clothing has medium sleeves, cotton fabric and pure color patterns. It has a lapel neckline. The lower clothing is of long length. The fabric is cotton and it has solid color patterns."

During training, I condition the ControlNet with the above text and the below pose maps obtained from the same Openpose Detector in gradio_pose2image.py

WOMEN-Jackets_Coats-id_00001226-02_7_additional

Begin from origin sd2.1. Image size and pose image size are both 512x512 HyperParameter: batch_size = 8 learning_rate = 1e-5 sd_locked = True only_mid_control = False accumulate_grad_batches = 2 precision =32

After training 37 epochs, the result is still poor:

samples_cfg_scale_9 00_gs-029767_e-000037_b-000500 samples_cfg_scale_9 00_gs-029267_e-000037_b-000000

These results fulfill the condition pose and text nicely. But the quality of the face is not as good as your human pose demo. I have tried using short text prompts by splitting the origin text by "." and randomly choosing one part. But the results didn't get better. Could you please help me to figure out the reason? Thanks!

johndpope commented 1 year ago

in the webui 1111 automatic - they have a fix faces [] post script tick box. maybe this will help

buaacyw commented 1 year ago

in the webui 1111 automatic - they have a fix faces [] post script tick box. maybe this will help

Thanks for your reply! But sorry I don't understand what "fix faces [] post script" means. Is this text prompt related to face?

johndpope commented 1 year ago

Restore faces []

https://youtu.be/dLM2Gz7GR44

johndpope commented 1 year ago

EE132B23-FD8E-4AAE-A983-DAE4CFE27683

lllyasviel commented 1 year ago

the result is not poor. it looks good. in webui if you use higher resolution the face will look much better.

buaacyw commented 1 year ago

Thanks a lot! I will try this.

buaacyw commented 1 year ago

the result is not poor. it looks good. in webui if you use higher resolution the face will look much better.

Thanks! Do you think my dataset is too small? Since your human pose ControlNet is trained on 400K dataset. By the way, I don't have enough computing resources to tune my hyperparameter. So could you please tell me the rough setting of your human pose training? Like learning rate, batch size and whether lock sd.

lllyasviel commented 1 year ago

in your case you may need to try crop the training samples into rectangles like 512*1024 or 512*768 samples with humans occupying larger part in the image. note that SD can be trained with any resolution as long as the number can be mod by 64. Your batchsize and parameters look ok to me but consider rent better cloud machines.

buaacyw commented 1 year ago

in your case you may need to try crop the training samples into rectangles like 5121024 or 512768 samples with humans occupying larger part in the image. note that SD can be trained with any resolution as long as the number can be mod by 64. Your batchsize and parameters look ok to me but consider rent better cloud machines.

Thanks a lot! I will try. My goal is to query the sd on the DeepFashion dataset domain, which is all model photos. So I think adding a prompt like 'model' or 'fashion' will help, but it didn't. Am I adding the wrong prompt? And I have tried adding those negative prompts (longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality). But the result doesn't change a lot. Is this normal?

buaacyw commented 1 year ago

I have tried and the faces get better. But I failed to find the theory of ”restore face“. Do you have any idea? And if it's possible to use this function without the webUI?

johndpope commented 1 year ago

it's possible to supercede creating new training set just by using stable diffusion. if you get a great prompt - you may be blown away by results. eg. sexy american asian jeans white top with white background then use controlent + posenet model and just grab an image from your training set to do style transfer. to get more fashion images in the region of your fashion net - use clip to describe fashion image - then describe it in prompt. yoga pants - black bra etc. I think the fashionnet is sub par to SD (5 billion images). you can find amazing trained SD models here - https://civitai.com/

buaacyw commented 1 year ago

Thanks for your share! I think it's really a good idea to use clip to get the best prompt. But I don't quite understand ‘grab an image from your training set to do style transfer’. I think the "training set" refers to the fashion dataset. But why do we need to do style transfer?

By the way, where to find more information about 'restore face'?

I have tried and the faces get better. But I failed to find the theory of ”restore face“. Do you have any idea? And if it's possible to use this function without the webUI?

johndpope commented 1 year ago

not "styletransfer" - but pose transfer - just lock the image dimensions to match fashion on both target image and controlnet image - then dream up a great fashion prompt. https://www.youtube.com/watch?v=ZCJX5ZAk9SA&ab_channel=OlivioSarikas

buaacyw commented 1 year ago

the result is not poor. it looks good. in webui if you use higher resolution the face will look much better.

@lllyasviel Hi! I have tried querying my model on higher resolution. I trained on 256x512. When querying on higher resolution like 512x1024 (in your repo gradio_pose2image, not webui. Change resolution from 256 to 512). The results turned very bad (long body, multiple body). But when queried on 256x512, the results are nice. I don't understand why you said higher resolution will get a better result.

buaacyw commented 1 year ago

not "styletransfer" - but pose transfer - just lock the image dimensions to match fashion on both target image and controlnet image - then dream up a great fashion prompt. https://www.youtube.com/watch?v=ZCJX5ZAk9SA&ab_channel=OlivioSarikas

Thanks! It helps a lot

lllyasviel / ControlNet

Unsatisfactory human pose training result #224