tencent-ailab / IP-Adapter

The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.
Apache License 2.0
4.5k stars 296 forks source link

some question about ip-adapter faceid training #255

Open Daming-TF opened 5 months ago

Daming-TF commented 5 months ago

I tried to train xl IPAdapt-FaceID using A6000*4, set the batch size of each gpu to 4, and set the learning rate to 1e-5, and used the pre-training model to initialize, I found that the graph was not normal when the training reached 7000 step. Is this setting normal for me? Is the learning rate too much or the batch is too small?

1000 step result: image

7000 step result: image

xiaohu2015 commented 5 months ago

I firstly training at 512x512 with 1e-4, then finetune with 1024x1024

Daming-TF commented 5 months ago

So do I understand that stage one, 512 scratch, your batch size is 64, and you train with the 1e-4 constant? And then the second stage is 1024fineturn, where you also set the batch size to 64 and train with the 1e-4 constant, right?

Daming-TF commented 5 months ago

I do not know where the problem, I am based on tutorial_train_faceid.py changed xl training code, I have also seen the training parameters are no problem, with the pre-training model initialization fine-tuning, I understand that the normal situation will not train to a certain number of steps collapse, do you have any idea how it happened

xiaohu2015 commented 5 months ago

how many images you used to train?

xiaohu2015 commented 5 months ago

So do I understand that stage one, 512 scratch, your batch size is 64, and you train with the 1e-4 constant? And then the second stage is 1024fineturn, where you also set the batch size to 64 and train with the 1e-4 constant, right?

512x512: batch size =8 8, lr = 1e-4, 1024x1024: batch size = 84, lr=1e-5

Daming-TF commented 5 months ago

how many images you used to train?

I used 40000 pictures, with a total batch size of 16 and a learning rate of 1e-5. When I ran 7000 steps, it was equivalent to input each picture into the network 3 to 4 times, but the result generated at this time was already abnormal

xiaohu2015 commented 5 months ago

It's strange. Is it getting worse during the training process?

Daming-TF commented 5 months ago

It's strange. Is it getting worse during the training process?

Is it because my batchsize is too small or my learning rate is too high? But I tried batch 64 learning rate 1e-5 and it gets worse during training

xiaohu2015 commented 5 months ago

It's strange. Is it getting worse during the training process?

Is it because my batchsize is too small or my learning rate is too high? But I tried batch 64 learning rate 1e-5 and it gets worse during training

lower lr maybe helpful, but I think it maybe other problems

Daming-TF commented 5 months ago

Thank you very much! I'll check it again.🤔

Daming-TF commented 5 months ago

Do you have any plans to release sdxl faceid training code? Also, I see that you have plans to write a blog about faceid training strategies. When do you plan to release the blog🙌

xiaohu2015 commented 5 months ago

Do you have any plans to release sdxl faceid training code? Also, I see that you have plans to write a blog about faceid training strategies. When do you plan to release the blog🙌

this weekend

George0726 commented 5 months ago

@xiaohu2015 Thanks! Can wait for your code! I cannot train my FaceID version as well.

xiaohu2015 commented 5 months ago

please refer to https://github.com/tencent-ailab/IP-Adapter/issues/266

Daming-TF commented 5 months ago

Hello, I set batch size to 8*8 and lr to 1e-4 according to the sd1.5 training strategy mentioned in wiki, but the training result is still worse, is this related to the use of IDloss/GAN loss function you mentioned

results of training to 2k step: image

xiaohu2015 commented 5 months ago

Hello, I set batch size to 8*8 and lr to 1e-4 according to the sd1.5 training strategy mentioned in wiki, but the training result is still worse, is this related to the use of IDloss/GAN loss function you mentioned

results of training to 2k step: image

According to my experience, faceid requires a relatively large training data set

Daming-TF commented 5 months ago

But I pre-loaded with ip-adapter-faceid_sd15.bin and trained the 2k step model with 40,000 data and the performance got worse, which is very strange

xiaohu2015 commented 5 months ago

But I pre-loaded with ip-adapter-faceid_sd15.bin and trained the 2k step model with 40,000 data and the performance got worse, which is very strange

try use FaceID Portrait?

as disscused in https://www.justinpinkney.com/blog/2024/face-mixer-diffusion/, id embedding is more like to overfit if using small dataset. (I used about 60w, instantid used about 30M)

Daming-TF commented 5 months ago

You mean instead of faceid-portrait model initialization training?

xiaohu2015 commented 5 months ago

You mean instead of faceid-portrait model initialization training?

yes

fkjkey commented 5 months ago

Excuse me,I need some advice

I have a cat face feature extractor. I try to train 'cat faceid'

[ {"image_file": "1.png", "text": "a white cat with brown eyes and a black nose", "id_embed_file": [-1.06619,0.41454,-1.00672,-0.35640,-0.49904,-0.57542,0.29658,0.92147,0.10751,0.81678,-0.53806,0.57883,-0.52852,-1.30455,0.65488,-0.12638,1.16432,-0.29545,1.81317,0.85081,1.45523,-0.66760,0.18620,-0.84919,0.23542,1.30015,-1.14157,0.35040,-0.26814,-1.27881,0.62444,1.89280,-0.01547,-0.27144,0.27133,1.35315,-1.10953,0.61235,0.07826,-2.72614,0.82147,0.39023,-1.83619,1.37831,-0.30502,0.39514,1.81468,0.08084,0.10553,1.26304,-1.22550,-0.03852,-0.71904,0.94494,-1.68716,-1.31506,1.24039,-0.68068,-0.59474,-0.27392,1.43254,-0.61730,1.61293,-0.29003,-0.43124,1.52883,0.59748,0.89301,0.51953,-0.11862,1.06993,1.09478,-0.89465,0.59243,-1.98534,-0.64330,-1.01632,1.26540,0.26129,-1.24408,-0.08179,0.29174,0.16396,2.37663,-0.34577,1.72913,-0.58767,-1.30741,-0.90884,1.97186,0.24465,1.23062,-1.18994,-0.12499,-0.36878,-0.20107,-2.56708,-0.35065,1.48816,-0.04979,-0.55711,-1.19418,-0.51879,0.42535,0.41985,-1.71987,0.89559,-0.31310,0.06942,-0.00458,0.39338,-0.87370,-0.52534,0.88671,-0.72366,0.41397,-0.28117,1.05863,-0.31250,-0.32015,0.78034,1.80917,1.46106,-0.19401,1.60517,-0.63110,0.62137,1.17815,2.13615,-0.88589,-0.33021,-0.21933,1.12148,-0.13090,-0.84180,0.52038,0.24419,0.14134,0.92830,0.26184,-1.59552,0.51709,0.43940,1.01220,-0.14739,-0.36741,-0.83985,-0.85450,-1.99505,0.42874,-0.15440,0.94941,0.83936,0.06470,-0.14495,0.09119,-0.87021,1.40156,-0.74798,-1.43978,-0.01425,-0.35008,0.03946,-0.57402,0.59478,-1.32056,0.15230,1.49150,0.84053,-0.48021,1.54277,-2.22828,-0.89611,-0.99298,0.06619,0.40193,-2.22082,-0.21353,0.08291,0.89445,0.07752,-0.58800,1.92989,0.22744,0.80934,0.30808,2.25371,-0.31249,-1.93554,-0.99068,1.90184,0.74689,-0.37594,-1.10748,0.13404,0.43801,0.84295,-0.06193,0.06232,-0.95713,-1.80503,0.22133,1.69402,-1.65940,-0.07570,0.45527,0.19906,-1.03386,-0.38974,0.61430,-0.76145,0.05755,0.86986,-0.67513,0.21784,-0.17604,-0.26485,1.33022,-0.34741,0.96939,-1.76046,-0.59230,2.23710,-0.88189,0.00036,-0.38222,-3.18739,-0.66371,-0.77913,-1.81384,0.89478,-0.40933,1.04842,0.92844,0.15754,-0.68183,0.22682,-0.21321,2.02058,-1.12012,-1.64497,1.64361,-0.37818,0.37356,0.09761,-0.72466,2.16160,0.19906,-1.64472,-3.00401,-0.89411,0.06310,-1.17689,0.78828,-1.74522,1.53626,0.43657,-0.25286,0.05723,0.09954,0.15277,-0.52360,0.29105,-0.67664,-0.21218,1.33271,-1.02915,-0.13614,-1.85740,-1.03715,-1.30919,1.90070,-0.00036,0.51415,-1.07825,-0.84982,2.27965,-0.70862,-0.15801,1.94082,-0.20944,-1.99299,0.28216,1.13182,-0.54146,0.39878,0.16302,0.57038,0.19580,-0.70247,-0.24128,0.36158,0.41826,-0.22671,-0.11414,-1.72934,1.27664,0.46510,-1.12654,0.37843,1.63603,-1.07120,-0.16228,-1.21849,0.61260,-0.64909,-0.52756,0.79790,-0.77471,0.47845,-1.04960,-0.35671,1.27604,1.97437,-2.22614,1.01915,-0.41807,1.61919,-0.25884,0.62954,-0.60930,1.07177,-1.03041,0.97225,0.89017,-0.87235,0.90021,1.58135,0.41755,-0.01789,-0.00998,-0.94558,-1.05353,2.07358,0.34427,-0.50855,0.32798,-1.41279,1.52165,-1.60181,-0.55876,-0.53621,1.21948,-0.18441,0.36697,-1.18217,-0.47765,-0.33131,-0.26263,-0.56021,-1.03647,0.76733,0.47157,-0.03274,0.57673,-0.74756,0.82300,-0.12980,0.29375,0.50300,0.77719,0.44174,-0.48805,1.69445,-1.35434,-0.26421,0.87310,0.59728,-1.78188,1.03799,-1.01941,-0.86977,-0.86079,-0.98536,0.76943,-0.22457,0.40005,-0.31083,-1.91855,1.04005,0.10062,0.92482,0.70560,-1.07356,1.13110,0.81239,-1.48434,-1.53708,-0.46512,0.17589,0.98368,-1.16264,0.94690,0.68240,-0.22103,-1.30555,-0.71884,0.37633,0.33802,0.14689,2.39727,-0.55537,0.67815,-0.04918,0.96460,1.14966,-0.84365,-0.24686,-1.08414,-0.30546,1.06168,0.29829,0.58185,-0.68335,1.69279,-0.28637,-0.88819,0.10139,-0.41068,-1.62092,-0.29781,0.70727,-1.34667,-0.46096,0.03329,-0.35073,0.93045,0.34032,-1.44286,0.77094,0.53346,0.78521,-0.36603,0.14268,-1.12351,-0.13690,0.00184,1.39873,-0.79075,-1.21248,1.51103,0.30586,0.70033,-1.18394,1.49957,0.07889,0.82315,-0.76643,-0.29011,0.39229,0.53150,-1.26041,-0.21425,0.33326,-0.61232,0.77211,-0.61295,-0.97542,-2.69481,0.62886,0.98421,-0.35717,-0.78248,0.71822,0.29761,1.48226,-0.73255,1.01393,-0.61016,-0.82220,1.57226,2.22221,1.32405,0.52206,-0.03310,0.27449,-0.17219,-1.04864,1.61713,-1.42797,-0.55141,0.49881,0.00939,1.06203,-0.15711,0.04569,-0.97903,-0.83025,0.14778,0.15659,1.06432,0.67945,-1.51481,-0.73782,2.65063,0.06069,1.26904,0.69502,-0.37454,0.75460,-2.69094,-0.12185,-0.29610,-1.91996,1.59398,-0.58407,0.85213,1.05070,1.26594,-0.92787,0.15182,0.64020]} ]

1

accelerate launch tutorial_train_faceid.py --pretrained_model_name_or_path '/home/moore_animateanyone/Moore-AnimateAnyone-master/pretrained_weights/reldogcat' --pretrained_ip_adapter_path /home/re_sd/ip_adapter/ip-adapter-faceid_sd15.bin --data_json_file /home/re_sd/ip_adapter/data/train.json --data_root_path /home/re_sd/ip_adapter/data/images --output_dir /home/re_sd/ip_adapter/data/weight --mixed_precision fp16 --image_encoder_path /home/lora_test/ip_adapter/clip_imgencoder --save_steps 50

But when I reasoned, it didn't work.I just used one data, want to see the effect of overfitting

faceid  44

There's a cat here. I think it's because there's a cat in my prompt

Daming-TF commented 5 months ago

@fkjkey I have also tried to overfit with 5 pictures, and the result of overfit should be consistent with or similar to the training sample you sent in

fkjkey commented 5 months ago

@Daming-TF So I have to use a lot of data to see the effect, right?

Daming-TF commented 5 months ago

@fkjkey What I'm trying to say is overfit on 1 or several images, if you use prompt :'1 cat', then the model overfit will perform the inference by input your training reference image and '1 cat', the result should be very similar to GroundTrue, but the result you send out above is not like this, So there are two possibilities: one there's something wrong with the code, and two there's not enough training steps to overfit

fkjkey commented 5 months ago

@Daming-TF thank you for your suggestion. Could you tell me if your data is cropped to only show the face?

Daming-TF commented 5 months ago

@fkjkey yes

fkjkey commented 5 months ago

@Daming-TF Hello, my data is cropped to only show the face, no matter how to write the prompt word when inference can only produce the face, may I ask how you solve it

Daming-TF commented 5 months ago

@fkjkey Perhaps you can crop the ori image at a fixed scale (e.g. 2/3) according to the size of the bbox