Seeking Advice on ControlNet training

lucasgblu commented 1 year ago

Greetings,

I tried to train my own inpaint version of controlnet on COCO datasets several times, but found it was hard to train it well. Basically, I have 330k amplified samples of COCO dataset, each sample has image, mask and caption. I have 8 A100 GPUs and I train controlnets from 30 to 90 epochs (some are way more than 50K steps as the controlnet paper recommends), after that, I will select a checkpoint every 10 epochs both manually and using HPS score (human preference score).

Here are some of my inpainting results. Please note that there are significantly more bad cases than good ones, which could be due to my incorrect, insufficient, or overfitted training.

Would it be possible for anyone to provide some guidance on the following questions:

How to judge whether my controlnet is over-fitted, How the loss curve of controlnet supposed to be? Throughout my whole training procedure , the loss cure looks like this, literally bouncing up and down and not strictly declining
To inherently amplify my small dataset, at a 25% chance, the mask will be inverted (switch black and white part), and at 5% chance, the mask will be replaced by total white, which force the controlnet to paint the whole area. Will feed ControlNet-inpaint with total black or total white mask (meaning to draw nothing or to draw everything) deteriorates the controlnet?
Is there any other feasible tricks I could use to boost the performance of my training?

I truly appreciate any assistance the community can provide.

geroldmeisinger commented 1 year ago

see my comment here https://github.com/lllyasviel/ControlNet-v1-1-nightly/issues/89#issuecomment-1763463584

please provide information on:

how you generate the masks
example of original image, control image+mask and generated image
all the parameters used for training

Is there any other feasible tricks I could use to boost the performance of my training?

I get convergence on simpler CNs with 50000a32 samples, that is under 1 epoch. Why you train on 30-90 epochs?

lucasgblu commented 1 year ago

Thanks for your reply @geroldmeisinger

how you generate the masks

The COCO dataset itself has segmentation of each image, we filter out the very big and very small masks and generate around 330k samples ( for one image may have different mask area).

example of original image, control image+mask and generated image

Here are some example. The caption is the origin concise caption, e.g. "a boy is eating a cate", "a boy on the baseball court"

all the parameters used for training

I've finished reading your enlightening articles, and here are my detailed parameters, including some of the parameters you highlighted in your article.

I use deepspeed for multiple GPU nodes training
batch size is 4 per GPU, since I have 16 GPUs, so the batch size is 64
gradient_accumulation_steps is set to 1, unfortunately, I did not notice this parameters so it was set as default 1
learning rate is 1e-05

If I miss something, you can always remind me.

Why you train on 30-90 epochs?

Actually, I am working on training my own inpaint version of ControlNet to repaint the background of some commodity. I hope our model strictly inpaint inside the mask and DO NOT has some extra stuff alongside the boundary of commodity. To do this, I use multi-channel input as the input of ControlNet

Since it includes canny, I think it's better to train for a long time. It's true that this may cause over-fitting and thus always produce monotonous background even though we control the boundary.

Why don't we use SD inpaint controlnet + canny controlnet? Because my Unet model is in pixel space not latent space, so I have to train my own inpaint controlnet. I tried to train two individual inpaint and canny controlnet, but the generated quality of their combination is not as good as I expected. For example, it may have saturated problem as Imagen paper stated, or seesaw phenomenon where we can control the edge but the background is not ok.

Hope my reply may help

geroldmeisinger commented 1 year ago

Okay, you clearly know more than me :) Did you make some simpler experiments and did you get better results? Also see: https://clipdrop.co/replace-background

DO NOT has some extra stuff alongside the boundary of commodity

Why is it this important? Can't you just composite the masked original on top of the generated image. Perfect boundaries, no changes.

I hope our model strictly inpaint inside the mask

To be clear, you want to train an inpaint model to outpaint a background? So the background has effectively no relation to the original image? Why don't you use SAM, get the background and just generate any background you want?

I use 5-channel input as the input of ControlNet

Wow. I would really like to see the code for this.

seesaw phenomenon

could you elaborate on the seesaw phenomenon please. never heard of it but sounds like the "meandering" phenomenon I noticed in edge drawing. paper or link is enough.

Are you working for Pepsi Co?

lucasgblu commented 1 year ago

lol I don't work for Pepsi, but I like to drink it.

clipdrop's functionality is great and the generated result is good. But the cost is expansive 😢

Can't you just composite the masked original on top of the generated image.

I do tried this before, but consider the example I show, if you want to put a pepsi on the table, and you only generate the table, it won't know where the pepsi is. So the generated image may have pepsi above the table, floating in the air. In order to let the model know the presence of commodity, we need to inject commodity image( as masked image) into the model.

I hope our model strictly inpaint inside the mask

This is because most inpaint model's goal is to blend inpaint area into the whole image as harmonious as possible. To do this, in most cases, there will be some extra stuff alongside the boundary, I hope to avoid this and have a clear boundary as which SD canny controlnet can produce.

what is a seesaw phenomenon?

seesaw phenomenon means that performance of one task is often improved by hurting the performance of some other tasks. In my case, when I use two individual controlnet, i.e. canny and inpaint, I encounter this:

I can control the edge to be clean and clear by weigh more in canny controlnet, but the inpaint result becomes worse
I can weigh more on inpaint controlnet to gain a great background but the model produces extra stuff alongside the boundary.
with the weight of canny controlnet increases (first column), the extra stuff diminishes, but the nike shoes I want does not show up.
with the weight of inpaint controlnet increases (first row), nike shoes shows up, but the background becomes more monotonous

Also, I don't know how to distribute the weight of both controlnets, 0.4 v.s. 0.6? 0.6 v.s. 0.8? So one day I came up the idea to use 5-channel input controlnet as I aforementioned, which concats masked image, mask and canny channel-wisely.

geroldmeisinger commented 1 year ago

I do tried this before, but consider the example I show, if you want to put a [sodacan] on the table, and you only generate the table, it won't know where the [sodacan] is. So the generated image may have [sodacan] above the table, floating in the air.

I see. Would it make sense to estimate a 3D bounding box (similar to the openpose model) to guide the "perspective" and "surface" of the object? Like a depth-map but only for the object. I still think you can use canny or a segment for post-processing.

seesaw phenomenon means that performance of one task is often improved by hurting the performance of some other tasks.

I see. Take a look at the Composer and UniControl papers, and double control discussion here: https://github.com/lllyasviel/ControlNet/discussions/30 (I linked the papers there)

lol I don't work for Pepsi, but I like to drink it.

(To me it's just irritating to see a branded, commercial product in research but that may be a cultural thing. Are you working for Nike? Nevermind.)

lucasgblu commented 1 year ago

@geroldmeisinger

I think your suggestions are great. I DO have consider to use depth rather than canny, but it involves another depth estimation model so I lowered the priority (also canny is very fast so that I can quickly build a dataset). I will try it in the near future after I get my current ControlNet right.

So back to the issue, after reading all your articles, I think I should:

try to find a reasonable checkpoints where maybe sudden converge. I think even 30 epochs may be to much and the controlnet has already over-fitted.
try to use gradient accumulation step? (maybe 4, so that batch size is 4164 = 256)
according to @lllyasviel statement as following, I should retrain my model using larger GAS after I find the step for suffden converge? am I right?

Because that "sudden converge" always happens, lets say "sudden converge" will happen at 3k step and our money can optimize 90k step, then we have two options: (1) train 3k steps, sudden converge, then train 87k steps. (2) 30x gradient accumulation, train 3k steps (90k real computation steps), then sudden converge.

geroldmeisinger commented 1 year ago

try to find a reasonable checkpoints where maybe sudden converge. I think even 30 epochs may be to much and the controlnet has already over-fitted.

I have no empirical data on this. all my models converged way before 1 epoch. I yet have to evaluate the effects on multi epoch training. But I know some popular CNs were trained on multiple epochs (Thibaud, SargeZT), but I guess mostly to reach convergence with very high batch sizes. If you get convergence on epoch 3, another 27 epochs aren't going to make the difference.

try to use gradient accumulation step? (maybe 4, so that batch size is 4164 = 256) according to @lllyasviel statement as following, I should retrain my model using larger GAS after I find the step for suffden converge?

I don't know the difference between batch size and gas but they way I understand it, it's basically the same thing and you will probably get diminishing results from even higher effective batch sizes. Note it is also written "But usually, if your logic batch size is already bigger than 256, then further extending the batch size is not very meaningful. In that case, perhaps a better idea is to train more steps." and "The batch size should not be reduced under any circumstances." [Ill23] Which means, if you can effort it, compute-wise, time-wise and have enough images, sure. But if you are still working on your concept, once you got convergence, start evaluating and improve your concept. If you have enough VRAM put everything into --train_batch_size, not gas. I don't know how long 90 epochs take for you. I'm working on a 3060 with 12GB RAM. One epoch for SD1 take 15h and I have to be very thoughtful about it (If I had your setup, I would have solved cancer by tomorrow, purely by training CNs alone).

lucasgblu commented 1 year ago

90 epochs run more than a week for me, but the performance is worse than 30 epochs. I trained for this long just to find out if I've reached the sudden converge.

Now after a pleasant communication with you, I think I've already surpassed that point before 30 epochs. I will try to do a rewind to check the performance before 30 epochs(did not check it before).

really appreciate your option and your kindness, thanks! @geroldmeisinger

geroldmeisinger commented 1 year ago

glad I was helpful. I think you should also make some simpler intermediate evaluations. When does canny alone converge with batchsize=64, 128, 256? When does classic inpainting alone converge with 0%, 25%, 50% mask? This should give you at least some hints when to expect convergence on double control.

geroldmeisinger commented 1 year ago

@lucasgblu another thing I just realized. if you condition the CN to outpaint the background, than wouldn't using the caption for the original-mask confuse the CN? as you are actually stating what is NOT there.

lucasgblu commented 1 year ago

@geroldmeisinger true, that's possible. I didn't change the prompt when I inverse the mask. The reason is simple that I don't know is there any inpaint dataset that has precise caption which only describe the mask area. Do you know one? I've read the SmartBrush paper by adobe. They use MSCOCO dataset as I do and use BLIP to caption the inpaint area. That will be time-consuming as the datasets contains 330K sample.

geroldmeisinger commented 1 year ago

No, sorry

here's a baseline for all your inpaint models: https://github.com/lllyasviel/ControlNet/discussions/561

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

lvZic commented 9 months ago

@lucasgblu have u successfully completed the inpainting training now?

engrmusawarali commented 9 months ago

I completed the training

lucasgblu commented 9 months ago

@lvZic yes I do, some discoveries may help:

Through careful check, the monotonous background is mainly due the incorrect initialization of the model weights. I used the old version checkpoint of my controlnet and continued training while my base unet has updated. After my switch to strictly copy controlnet encoder from unet encoder, the generated background is great.
after correct initialization, I did a ablation study on whether inverting the mask helps. The answer is not too much, but also not deteriorated. So feel free to use this trick or not.
Be careful and strict with your data. The dataset I use have lots of rubbish images. I used a whole weekend to label and throw them away.
I choose to use depth channel rather than canny channel. It helps a little bit in space-sencing.

lvZic commented 9 months ago

@lvZic yes I do, some discoveries may help:

Through careful check, the monotonous background is mainly due the incorrect initialization of the model weights. I used the old version checkpoint of my controlnet and continued training while my base unet has updated. After my switch to strictly copy controlnet encoder from unet encoder, the generated background is great.

after correct initialization, I did a ablation study on whether inverting the mask helps. The answer is not too much, but also not deteriorated. So feel free to use this trick or not.

Be careful and strict with your data. The dataset I use have lots of rubbish images. I used a whole weekend to label and throw them away.

I choose to use depth channel rather than canny channel. It helps a little bit in space-sencing.

Below are my recent generated inpaint images. "a cup of drink in the wild in spring, close-up"

good job! I wonder the pretrained UNET model u used is sd-base1.5 or sd-inpaint? And in my opinion, the controlnet model's input channels is different with input channels of UNET model, so how u init the different weights that not exists in UNET model?

lucasgblu commented 9 months ago

@lvZic we use a pixel space model, but sd-base 1.5 should works fine. If you use controlnet, you need to use sd-base 1.5, otherwise sd-inpaint.

Yes you are correct, the weights are different from the UNet. However they only diff in the first convolutional layer in the input block (since we add two additional channels.). As mentioned in Glide paper section 4.3:

We modify the model architecture to have four additional input channels: a second set of RGB channels, and a mask channel. We initialize the corresponding input weights for these new channels to zero before fine-tuning. For the upsampling model, we always provide the full low-resolution image, but only provide the unmasked region of the high-resolution image

We follow the same method and the additional unexist-in-unet weights are initialized as zeros.

target_shape = list(model_state_dict[key].shape)
target_shape[1] = target_shape[1] - origin_weight.shape[1]
zeros = th.zeros(target_shape)
model_state_dict[key] = th.cat((origin_weight, zeros), dim=1)

lvZic commented 9 months ago

@lvZic we use a pixel space model like Imagen, but sd-base 1.5 should works fine. If you use controlnet, you need to use sd-base 1.5, otherwise sd-inpaint.

Yes you are correct, the weights are different from the UNet. However they only diff in the first convolutional layer in the input block (since we add two additional channels.). As mentioned in Glide paper section 4.3:

We modify the model architecture to have four additional input channels: a second set of RGB channels, and a mask channel. We initialize the corresponding input weights for these new channels to zero before fine-tuning. For the upsampling model, we always provide the full low-resolution image, but only provide the unmasked region of the high-resolution image

We follow the same method and the additional unexist-in-unet weights are initialized as zeros.
target_shape = list(model_state_dict[key].shape)
target_shape[1] = target_shape[1] - origin_weight.shape[1]
zeros = th.zeros(target_shape)
model_state_dict[key] = th.cat((origin_weight, zeros), dim=1)

got it. By the way, how the batchsize, learning rate and epoch be set in ur experiment?

huggingface / diffusers

Seeking Advice on ControlNet training #5406