lllyasviel / ControlNet

Let us control diffusion models!
Apache License 2.0
29.77k stars 2.69k forks source link

Training datasets #93

Open thibaudart opened 1 year ago

thibaudart commented 1 year ago

First of all, Thanks a lot for your work on this amazing tool!

Could you share the datasets used for training? With that we can make the training on sd2.1!

sepro commented 1 year ago

Thanks for these models! ControlNet results with my 1.5 models were awesome, but I have trained so many 2.1 embeddings I'd love to use with this.

hatlessman commented 1 year ago

I'm sure we could get together funds to train on some A100's, but the training data is the real problem. Is the data even able to be released? Legal issues?

thibaudart commented 1 year ago

Training seems fast and not really expensive. i hope they ll be answer, the other solution is to generate our dataset. (100-300k images then use the script to get the scribble, openpose, depth… versions and train after… it ll work but it ll be more energy efficient if we have the data). @hatlessman you can ping me on Twitter (@thibaudz)

lllyasviel commented 1 year ago

Given the current complicated situation outside research community, we refrain from disclosing more details about data. Nevertheless, researchers may take a look at that dataset project everyone know.

thibaudart commented 1 year ago

Thanks @lllyasviel for your reply.

do you plan to train with sd2.1?

notrydo commented 1 year ago

If I can help with funds I'd be happy to help. I'm disappointed in current Open landscape.

On Sun, Feb 19, 2023, 12:55 PM thibaudart @.***> wrote:

Thanks @lllyasviel https://github.com/lllyasviel for your reply.

do you plan to train with sd2.1?

— Reply to this email directly, view it on GitHub https://github.com/lllyasviel/ControlNet/issues/93#issuecomment-1436090830, or unsubscribe https://github.com/notifications/unsubscribe-auth/A567NIQTY2GYWNSFF5PMFNLWYKCEXANCNFSM6AAAAAAU7MPJBQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

thibaudart commented 1 year ago

@notrydo the first step is having a dataset for the training. If you have 100/300K of image (512x512) of quality and varied, it could be useful. If not, we will need to find a prompt dataset and generate them. (it takes roughly 24 hours to generate 40K images, so around 10 days to have the images, after that it'll take few days to BLIP them and to generate the preprocessed versions)

notrydo commented 1 year ago

Any additional specifications?

On Mon, Feb 20, 2023, 10:52 AM thibaudart @.***> wrote:

@notrydo https://github.com/notrydo the first step is having a dataset for the training. If you have 100/300K of image (512x512) of quality and varied, it could be useful. If not, we will need to find a prompt dataset and generate them. (it takes roughly 24 hours to generate 40K images, so around 10 days to have the images, after that it'll take few days to BLIP them and to generate the preprocessed versions)

— Reply to this email directly, view it on GitHub https://github.com/lllyasviel/ControlNet/issues/93#issuecomment-1437427767, or unsubscribe https://github.com/notifications/unsubscribe-auth/A567NIWTLOADIQWRJXORD3TWYO4NJANCNFSM6AAAAAAU7MPJBQ . You are receiving this because you were mentioned.Message ID: @.***>

notrydo commented 1 year ago

If known where we can purchase drop link.

On Mon, Feb 20, 2023, 2:58 PM Amanda Besemer @.***> wrote:

Any additional specifications?

On Mon, Feb 20, 2023, 10:52 AM thibaudart @.***> wrote:

@notrydo https://github.com/notrydo the first step is having a dataset for the training. If you have 100/300K of image (512x512) of quality and varied, it could be useful. If not, we will need to find a prompt dataset and generate them. (it takes roughly 24 hours to generate 40K images, so around 10 days to have the images, after that it'll take few days to BLIP them and to generate the preprocessed versions)

— Reply to this email directly, view it on GitHub https://github.com/lllyasviel/ControlNet/issues/93#issuecomment-1437427767, or unsubscribe https://github.com/notifications/unsubscribe-auth/A567NIWTLOADIQWRJXORD3TWYO4NJANCNFSM6AAAAAAU7MPJBQ . You are receiving this because you were mentioned.Message ID: @.***>

sALTaccount commented 1 year ago

I am currently training a sketch to image model on Waifu Diffusion 1.5 (which uses SD 2.1 v prediction). I made a dataset of 1 million sketch image pairs, and I'm training with 50% unconditional chance (like in the paper). Here are the results so far at 150k samples seen: image image image image

lllyasviel commented 1 year ago

I am currently training a sketch to image model on Waifu Diffusion 1.5 (which uses SD 2.1 v prediction). I made a dataset of 1 million sketch image pairs, and I'm training with 50% unconditional chance (like in the paper). Here are the results so far at 150k samples seen: image image image image

Anime models needs larger batchsize and lower (or disabling) text dropping because their tags are dense. Also, because of sudden converge phenomenon, use 10* gradient accumulation to optimize 15k steps will be better than 150k steps.

lllyasviel commented 1 year ago

I am currently training a sketch to image model on Waifu Diffusion 1.5 (which uses SD 2.1 v prediction). I made a dataset of 1 million sketch image pairs, and I'm training with 50% unconditional chance (like in the paper). Here are the results so far at 150k samples seen: image image image image

See also updated last section of https://github.com/lllyasviel/ControlNet/blob/main/docs/train.md

thibaudart commented 1 year ago

Don’t know.

Le lun. 20 févr. 2023 à 18:59, notrydo @.***> a écrit :

If known where we can purchase drop link.

On Mon, Feb 20, 2023, 2:58 PM Amanda Besemer @.***> wrote:

Any additional specifications?

On Mon, Feb 20, 2023, 10:52 AM thibaudart @.***> wrote:

@notrydo https://github.com/notrydo the first step is having a dataset for the training. If you have 100/300K of image (512x512) of quality and varied, it could be useful. If not, we will need to find a prompt dataset and generate them. (it takes roughly 24 hours to generate 40K images, so around 10 days to have the images, after that it'll take few days to BLIP them and to generate the preprocessed versions)

— Reply to this email directly, view it on GitHub < https://github.com/lllyasviel/ControlNet/issues/93#issuecomment-1437427767 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/A567NIWTLOADIQWRJXORD3TWYO4NJANCNFSM6AAAAAAU7MPJBQ

. You are receiving this because you were mentioned.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/lllyasviel/ControlNet/issues/93#issuecomment-1437648800, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYWNXDBGXV62KHSC6UTZSJDWYPZOHANCNFSM6AAAAAAU7MPJBQ . You are receiving this because you authored the thread.Message ID: @.***>

--

T

thibaudart commented 1 year ago

@lllyasviel if we paid you for the training, could you do it for 2.1?

Le lun. 20 févr. 2023 à 19:58, lllyasviel @.***> a écrit :

I am currently training a sketch to image model on Waifu Diffusion 1.5 (which uses SD 2.1 v prediction). I made a dataset of 1 million sketch image pairs, and I'm training with 50% unconditional chance (like in the paper). Here are the results so far at 150k samples seen: [image: image] https://user-images.githubusercontent.com/18043686/220213337-21c349b1-c39b-4095-94df-f032ec3c3e0d.png [image: image] https://user-images.githubusercontent.com/18043686/220213345-61279016-3d6c-4220-8227-f013728b6004.png [image: image] https://user-images.githubusercontent.com/18043686/220213350-9cf593cc-d9fa-4777-92bc-3e70b0c0f909.png [image: image] https://user-images.githubusercontent.com/18043686/220213353-0780274f-2bdf-44fd-a3ff-3a76b6d8c0d8.png

Anime models needs larger batchsize and lower (or disabling) text dropping because their tags are dense.

— Reply to this email directly, view it on GitHub https://github.com/lllyasviel/ControlNet/issues/93#issuecomment-1437678784, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYWNXDHBOBYL6TJ73GE3HKDWYQALDANCNFSM6AAAAAAU7MPJBQ . You are receiving this because you authored the thread.Message ID: @.***>

--

T

sALTaccount commented 1 year ago

Anime models needs larger batchsize and lower (or disabling) text dropping because their tags are dense. Also, because of sudden converge phenomenon, use 10* gradient accumulation to optimize 15k steps will be better than 150k steps.

@lllyasviel Yeah I'm using as large of a batch size as I can on this machine, which is 1x A40. Going to be switching to 4x A40s soon though. I'm able to fit 18 batch size @ 512 resolution currently. I want to try 768 resolution for the next training run with the 4 GPUs, so I'm not sure what that will look like in terms of batch size.

I'll make the changes in the unconditional dropping, I might copy over the "partial dropout" code from waifu diffusion training, where we train with a variable percentage of the prompt (50% chance to have 0% to 100% of the tags, 50% chance to have 100%), except maybe moving the percentages to maybe have a 30% chance of partial dropout.

Very interesting about the sudden converge phenomenon. I've noticed this phenomenon with normal Waifu Diffusion 1.5 as well. I don't quite see how changing the gradient accumulation steps helps with this though, could you explain that part further?

Would love to talk about this more with you, is there a better way of contacting you (email, discord?)

lllyasviel commented 1 year ago

Anime models needs larger batchsize and lower (or disabling) text dropping because their tags are dense. Also, because of sudden converge phenomenon, use 10* gradient accumulation to optimize 15k steps will be better than 150k steps.

@lllyasviel Yeah I'm using as large of a batch size as I can on this machine, which is 1x A40. Going to be switching to 4x A40s soon though. I'm able to fit 18 batch size @ 512 resolution currently. I want to try 768 resolution for the next training run with the 4 GPUs, so I'm not sure what that will look like in terms of batch size.

I'll make the changes in the unconditional dropping, I might copy over the "partial dropout" code from waifu diffusion training, where we train with a variable percentage of the prompt (50% chance to have 0% to 100% of the tags, 50% chance to have 100%), except maybe moving the percentages to maybe have a 30% chance of partial dropout.

Very interesting about the sudden converge phenomenon. I've noticed this phenomenon with normal Waifu Diffusion 1.5 as well. I don't quite see how changing the gradient accumulation steps helps with this though, could you explain that part further?

Because that "sudden converge" always happens, lets say "sudden converge" will happen at 3k step and our money can optimize 90k step, then we have two options: (1) train 3k steps, sudden converge, then train 87k steps. (2) 30x gradient accumulation, train 3k steps (90k real computation steps), then sudden converge.

In my experiments, (2) is usually better than (1). However, in real cases, perhaps you may need to balance the steps before and after the "sudden converge" on your own to find a balance. The training after "sudden converge" is also important.

sALTaccount commented 1 year ago

@lllyasviel I see. Just curious, do you think that it would make sense to try the same technique with a normal diffusion model? Would love to talk more about this, but I'm not sure is a github issue about training data is the best place lol. My discord is salt#1111 if we could talk there, although since this is research I'm not sure if there is some requirement where when you talk about it, it has to be in public. Maybe a new thread under the github discussions?

Just read your edit, do you mean that after the "sudden converge", I should reduce my gradient accumulation steps?

lllyasviel commented 1 year ago

no. The batch size should not be reduced under any circumstances. In addition, we should always remember that we are not training layers from scratch, we are optimizing some projections between existing layers. We are still fine tuning a SD. Any bad training that can fail SD fine tuning will fail controlnet training. Feel free to open disscussion if necessary.

batrlatom commented 1 year ago

just for sake of reference ... is this the correct approach for grad_acc ? N = 10 trainer = pl.Trainer(gpus=1, precision=32, callbacks=[logger], amp_backend='apex', accumulate_grad_batches=N)

?

ousinkou commented 1 year ago

When modifying batch size or gradient accumulation, shoud I modify learning rate?

offchan42 commented 1 year ago

@lllyasviel can you share hyperparameters you used for training e.g. batch size, effective batch size, number of GPUs, number of worker nodes, learning rate, number of training steps, etc? I saw in the paper you mentioned learning rate as 1e-5 using AdamW optimizer but I'm not sure about the other hyperparameters. I'm especially interested in effective batch size because it affects the accuracy of the gradients.

By effective batch size, I refer to this value batch_size_per_gpu * n_GPUs_per_worker_node * n_worker_nodes * gradient_accumulation_steps

whydna commented 1 year ago

I am currently training a sketch to image model on Waifu Diffusion 1.5 (which uses SD 2.1 v prediction). I made a dataset of 1 million sketch image pairs, and I'm training with 50% unconditional chance (like in the paper). Here are the results so far at 150k samples seen: image image image image

Can you share a bit more about what you mean by 50% unconditional chance?

offchan42 commented 1 year ago

@whydna it means that there is a 50% chance for the text prompt input to be dropped (set to empty string) when training the model and only the control image will be used (the sketch in this case). This forces the model to not rely too much on the text and try to generate the entire image just from the control image alone. image

whydna commented 1 year ago

@off99555 thanks for the explanation - makes sense.

Is this achieved by just omitting prompts for 50% of the data set in prompts.json? Or is there some param to do it in the training function?

offchan42 commented 1 year ago

@off99555 thanks for the explanation - makes sense.

Is this achieved by just omitting prompts for 50% of the data set in prompts.json? Or is there some param to do it in the training function?

It should be done in the code dynamically. Here is example code I found in another repository that does 10% dropping of text prompt: https://github.com/timothybrooks/instruct-pix2pix/blob/0dffd1eeb02611c35088462d1df88714ce2b52f4/stable_diffusion/ldm/models/diffusion/ddpm_edit.py#L701-L707

I'm not sure where this piece of code exists in ControlNet repo.

sALTaccount commented 1 year ago

The code doesn't exist in the controlnet repo, you have to write it yourself. Also, I talked to the author and he said that 50% is too high for sketch, it should be more like 0-10%

lilisierrayu commented 1 year ago

Given the current complicated situation outside research community, we refrain from disclosing more details about data. Nevertheless, researchers may take a look at that dataset project everyone know.

@lllyasviel could you please share the details of each feature extracters, such as the threshold used by canny(), mlsd() and midas ?

Luccadoremi commented 1 year ago

I am currently training a sketch to image model on Waifu Diffusion 1.5 (which uses SD 2.1 v prediction). I made a dataset of 1 million sketch image pairs, and I'm training with 50% unconditional chance (like in the paper). Here are the results so far at 150k samples seen: image image image image

See also updated last section of https://github.com/lllyasviel/ControlNet/blob/main/docs/train.md

Could you share the hyperprameter you use? what is the learning rate and effective batch size?