Open thibaudart opened 1 year ago
Thanks for these models! ControlNet results with my 1.5 models were awesome, but I have trained so many 2.1 embeddings I'd love to use with this.
I'm sure we could get together funds to train on some A100's, but the training data is the real problem. Is the data even able to be released? Legal issues?
Training seems fast and not really expensive. i hope they ll be answer, the other solution is to generate our dataset. (100-300k images then use the script to get the scribble, openpose, depth… versions and train after… it ll work but it ll be more energy efficient if we have the data). @hatlessman you can ping me on Twitter (@thibaudz)
Given the current complicated situation outside research community, we refrain from disclosing more details about data. Nevertheless, researchers may take a look at that dataset project everyone know.
Thanks @lllyasviel for your reply.
do you plan to train with sd2.1?
If I can help with funds I'd be happy to help. I'm disappointed in current Open landscape.
On Sun, Feb 19, 2023, 12:55 PM thibaudart @.***> wrote:
Thanks @lllyasviel https://github.com/lllyasviel for your reply.
do you plan to train with sd2.1?
— Reply to this email directly, view it on GitHub https://github.com/lllyasviel/ControlNet/issues/93#issuecomment-1436090830, or unsubscribe https://github.com/notifications/unsubscribe-auth/A567NIQTY2GYWNSFF5PMFNLWYKCEXANCNFSM6AAAAAAU7MPJBQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@notrydo the first step is having a dataset for the training. If you have 100/300K of image (512x512) of quality and varied, it could be useful. If not, we will need to find a prompt dataset and generate them. (it takes roughly 24 hours to generate 40K images, so around 10 days to have the images, after that it'll take few days to BLIP them and to generate the preprocessed versions)
Any additional specifications?
On Mon, Feb 20, 2023, 10:52 AM thibaudart @.***> wrote:
@notrydo https://github.com/notrydo the first step is having a dataset for the training. If you have 100/300K of image (512x512) of quality and varied, it could be useful. If not, we will need to find a prompt dataset and generate them. (it takes roughly 24 hours to generate 40K images, so around 10 days to have the images, after that it'll take few days to BLIP them and to generate the preprocessed versions)
— Reply to this email directly, view it on GitHub https://github.com/lllyasviel/ControlNet/issues/93#issuecomment-1437427767, or unsubscribe https://github.com/notifications/unsubscribe-auth/A567NIWTLOADIQWRJXORD3TWYO4NJANCNFSM6AAAAAAU7MPJBQ . You are receiving this because you were mentioned.Message ID: @.***>
If known where we can purchase drop link.
On Mon, Feb 20, 2023, 2:58 PM Amanda Besemer @.***> wrote:
Any additional specifications?
On Mon, Feb 20, 2023, 10:52 AM thibaudart @.***> wrote:
@notrydo https://github.com/notrydo the first step is having a dataset for the training. If you have 100/300K of image (512x512) of quality and varied, it could be useful. If not, we will need to find a prompt dataset and generate them. (it takes roughly 24 hours to generate 40K images, so around 10 days to have the images, after that it'll take few days to BLIP them and to generate the preprocessed versions)
— Reply to this email directly, view it on GitHub https://github.com/lllyasviel/ControlNet/issues/93#issuecomment-1437427767, or unsubscribe https://github.com/notifications/unsubscribe-auth/A567NIWTLOADIQWRJXORD3TWYO4NJANCNFSM6AAAAAAU7MPJBQ . You are receiving this because you were mentioned.Message ID: @.***>
I am currently training a sketch to image model on Waifu Diffusion 1.5 (which uses SD 2.1 v prediction). I made a dataset of 1 million sketch image pairs, and I'm training with 50% unconditional chance (like in the paper). Here are the results so far at 150k samples seen:
I am currently training a sketch to image model on Waifu Diffusion 1.5 (which uses SD 2.1 v prediction). I made a dataset of 1 million sketch image pairs, and I'm training with 50% unconditional chance (like in the paper). Here are the results so far at 150k samples seen:
Anime models needs larger batchsize and lower (or disabling) text dropping because their tags are dense. Also, because of sudden converge phenomenon, use 10* gradient accumulation to optimize 15k steps will be better than 150k steps.
I am currently training a sketch to image model on Waifu Diffusion 1.5 (which uses SD 2.1 v prediction). I made a dataset of 1 million sketch image pairs, and I'm training with 50% unconditional chance (like in the paper). Here are the results so far at 150k samples seen:
See also updated last section of https://github.com/lllyasviel/ControlNet/blob/main/docs/train.md
Don’t know.
Le lun. 20 févr. 2023 à 18:59, notrydo @.***> a écrit :
If known where we can purchase drop link.
On Mon, Feb 20, 2023, 2:58 PM Amanda Besemer @.***> wrote:
Any additional specifications?
On Mon, Feb 20, 2023, 10:52 AM thibaudart @.***> wrote:
@notrydo https://github.com/notrydo the first step is having a dataset for the training. If you have 100/300K of image (512x512) of quality and varied, it could be useful. If not, we will need to find a prompt dataset and generate them. (it takes roughly 24 hours to generate 40K images, so around 10 days to have the images, after that it'll take few days to BLIP them and to generate the preprocessed versions)
— Reply to this email directly, view it on GitHub < https://github.com/lllyasviel/ControlNet/issues/93#issuecomment-1437427767 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/A567NIWTLOADIQWRJXORD3TWYO4NJANCNFSM6AAAAAAU7MPJBQ
. You are receiving this because you were mentioned.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/lllyasviel/ControlNet/issues/93#issuecomment-1437648800, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYWNXDBGXV62KHSC6UTZSJDWYPZOHANCNFSM6AAAAAAU7MPJBQ . You are receiving this because you authored the thread.Message ID: @.***>
--
T
@lllyasviel if we paid you for the training, could you do it for 2.1?
Le lun. 20 févr. 2023 à 19:58, lllyasviel @.***> a écrit :
I am currently training a sketch to image model on Waifu Diffusion 1.5 (which uses SD 2.1 v prediction). I made a dataset of 1 million sketch image pairs, and I'm training with 50% unconditional chance (like in the paper). Here are the results so far at 150k samples seen: [image: image] https://user-images.githubusercontent.com/18043686/220213337-21c349b1-c39b-4095-94df-f032ec3c3e0d.png [image: image] https://user-images.githubusercontent.com/18043686/220213345-61279016-3d6c-4220-8227-f013728b6004.png [image: image] https://user-images.githubusercontent.com/18043686/220213350-9cf593cc-d9fa-4777-92bc-3e70b0c0f909.png [image: image] https://user-images.githubusercontent.com/18043686/220213353-0780274f-2bdf-44fd-a3ff-3a76b6d8c0d8.png
Anime models needs larger batchsize and lower (or disabling) text dropping because their tags are dense.
— Reply to this email directly, view it on GitHub https://github.com/lllyasviel/ControlNet/issues/93#issuecomment-1437678784, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYWNXDHBOBYL6TJ73GE3HKDWYQALDANCNFSM6AAAAAAU7MPJBQ . You are receiving this because you authored the thread.Message ID: @.***>
--
T
Anime models needs larger batchsize and lower (or disabling) text dropping because their tags are dense. Also, because of sudden converge phenomenon, use 10* gradient accumulation to optimize 15k steps will be better than 150k steps.
@lllyasviel Yeah I'm using as large of a batch size as I can on this machine, which is 1x A40. Going to be switching to 4x A40s soon though. I'm able to fit 18 batch size @ 512 resolution currently. I want to try 768 resolution for the next training run with the 4 GPUs, so I'm not sure what that will look like in terms of batch size.
I'll make the changes in the unconditional dropping, I might copy over the "partial dropout" code from waifu diffusion training, where we train with a variable percentage of the prompt (50% chance to have 0% to 100% of the tags, 50% chance to have 100%), except maybe moving the percentages to maybe have a 30% chance of partial dropout.
Very interesting about the sudden converge phenomenon. I've noticed this phenomenon with normal Waifu Diffusion 1.5 as well. I don't quite see how changing the gradient accumulation steps helps with this though, could you explain that part further?
Would love to talk about this more with you, is there a better way of contacting you (email, discord?)
Anime models needs larger batchsize and lower (or disabling) text dropping because their tags are dense. Also, because of sudden converge phenomenon, use 10* gradient accumulation to optimize 15k steps will be better than 150k steps.
@lllyasviel Yeah I'm using as large of a batch size as I can on this machine, which is 1x A40. Going to be switching to 4x A40s soon though. I'm able to fit 18 batch size @ 512 resolution currently. I want to try 768 resolution for the next training run with the 4 GPUs, so I'm not sure what that will look like in terms of batch size.
I'll make the changes in the unconditional dropping, I might copy over the "partial dropout" code from waifu diffusion training, where we train with a variable percentage of the prompt (50% chance to have 0% to 100% of the tags, 50% chance to have 100%), except maybe moving the percentages to maybe have a 30% chance of partial dropout.
Very interesting about the sudden converge phenomenon. I've noticed this phenomenon with normal Waifu Diffusion 1.5 as well. I don't quite see how changing the gradient accumulation steps helps with this though, could you explain that part further?
Because that "sudden converge" always happens, lets say "sudden converge" will happen at 3k step and our money can optimize 90k step, then we have two options: (1) train 3k steps, sudden converge, then train 87k steps. (2) 30x gradient accumulation, train 3k steps (90k real computation steps), then sudden converge.
In my experiments, (2) is usually better than (1). However, in real cases, perhaps you may need to balance the steps before and after the "sudden converge" on your own to find a balance. The training after "sudden converge" is also important.
@lllyasviel I see. Just curious, do you think that it would make sense to try the same technique with a normal diffusion model? Would love to talk more about this, but I'm not sure is a github issue about training data is the best place lol. My discord is salt#1111 if we could talk there, although since this is research I'm not sure if there is some requirement where when you talk about it, it has to be in public. Maybe a new thread under the github discussions?
Just read your edit, do you mean that after the "sudden converge", I should reduce my gradient accumulation steps?
no. The batch size should not be reduced under any circumstances. In addition, we should always remember that we are not training layers from scratch, we are optimizing some projections between existing layers. We are still fine tuning a SD. Any bad training that can fail SD fine tuning will fail controlnet training. Feel free to open disscussion if necessary.
just for sake of reference ... is this the correct approach for grad_acc ? N = 10 trainer = pl.Trainer(gpus=1, precision=32, callbacks=[logger], amp_backend='apex', accumulate_grad_batches=N)
?
When modifying batch size or gradient accumulation, shoud I modify learning rate?
@lllyasviel can you share hyperparameters you used for training e.g. batch size, effective batch size, number of GPUs, number of worker nodes, learning rate, number of training steps, etc? I saw in the paper you mentioned learning rate as 1e-5 using AdamW optimizer but I'm not sure about the other hyperparameters. I'm especially interested in effective batch size because it affects the accuracy of the gradients.
By effective batch size, I refer to this value batch_size_per_gpu * n_GPUs_per_worker_node * n_worker_nodes * gradient_accumulation_steps
I am currently training a sketch to image model on Waifu Diffusion 1.5 (which uses SD 2.1 v prediction). I made a dataset of 1 million sketch image pairs, and I'm training with 50% unconditional chance (like in the paper). Here are the results so far at 150k samples seen:
Can you share a bit more about what you mean by 50% unconditional chance?
@whydna it means that there is a 50% chance for the text prompt input to be dropped (set to empty string) when training the model and only the control image will be used (the sketch in this case). This forces the model to not rely too much on the text and try to generate the entire image just from the control image alone.
@off99555 thanks for the explanation - makes sense.
Is this achieved by just omitting prompts for 50% of the data set in prompts.json? Or is there some param to do it in the training function?
@off99555 thanks for the explanation - makes sense.
Is this achieved by just omitting prompts for 50% of the data set in prompts.json? Or is there some param to do it in the training function?
It should be done in the code dynamically. Here is example code I found in another repository that does 10% dropping of text prompt: https://github.com/timothybrooks/instruct-pix2pix/blob/0dffd1eeb02611c35088462d1df88714ce2b52f4/stable_diffusion/ldm/models/diffusion/ddpm_edit.py#L701-L707
I'm not sure where this piece of code exists in ControlNet repo.
The code doesn't exist in the controlnet repo, you have to write it yourself. Also, I talked to the author and he said that 50% is too high for sketch, it should be more like 0-10%
Given the current complicated situation outside research community, we refrain from disclosing more details about data. Nevertheless, researchers may take a look at that dataset project everyone know.
@lllyasviel could you please share the details of each feature extracters, such as the threshold used by canny(), mlsd() and midas ?
I am currently training a sketch to image model on Waifu Diffusion 1.5 (which uses SD 2.1 v prediction). I made a dataset of 1 million sketch image pairs, and I'm training with 50% unconditional chance (like in the paper). Here are the results so far at 150k samples seen:
See also updated last section of https://github.com/lllyasviel/ControlNet/blob/main/docs/train.md
Could you share the hyperprameter you use? what is the learning rate and effective batch size?
First of all, Thanks a lot for your work on this amazing tool!
Could you share the datasets used for training? With that we can make the training on sd2.1!