hollowstrawberry / kohya-colab

Accessible Google Colab notebooks for Stable Diffusion Lora training, based on the work of kohya-ss and Linaqruf
GNU General Public License v3.0
564 stars 79 forks source link

Problems with Pony V6 diffusers model #148

Open gwhitez opened 1 month ago

gwhitez commented 1 month ago

When trying to train a lora with pony v6 I get the error that the repository was not found.

image image

Lunar2k21 commented 1 month ago

Same issue here, first at the end of training started getting NaNs so I went to restart it only to get greeted by this.

dazaibsd commented 1 month ago

same here! and putting the link from Civitai gives me "CUDA out of memory"

Ariiio commented 1 month ago

you can use this link if you'd like to, I uploaded the model to huggingface :)

gwhitez commented 1 month ago

you can use this link if you'd like to, I uploaded the model to huggingface :)

thank you so much!!! 👍👍

AIuser0101 commented 1 month ago

How do you use the use the 'optional_custom_training_model_url:' field in this notebook?

I want to upload my own custom checkpoints, can someone please reference me to how I would setup a huggingface model page to do so? I somehow don't think just uploading the safetensor file would work. Thanks,

gwhitez commented 1 month ago

How do you use the use the 'optional_custom_training_model_url:' field in this notebook?

I want to upload my own custom checkpoints, can someone please reference me to how I would setup a huggingface model page to do so? I somehow don't think just uploading the safetensor file would work. Thanks,

in the custom url field you place this link or any other model that is in diffusers, if you have colab pro you can place a link of a civitai model in safetensors.

LegendaryNoCon commented 1 month ago

How do you use the use the 'optional_custom_training_model_url:' field in this notebook? I want to upload my own custom checkpoints, can someone please reference me to how I would setup a huggingface model page to do so? I somehow don't think just uploading the safetensor file would work. Thanks,

in the custom url field you place this link or any other model that is in diffusers, if you have colab pro you can place a link of a civitai model in safetensors.

So I'm trying it right now and literally all of my lora trainings are either stopping before they even get to the epochs part, or the ones that make it to the epochs part are sitting at 0% and not moving. I don't know enough about this stuff to be able to tell if I'm the one doing something wrong with it or not. Is it because I'm training 7 at once? I didn't think that'd affect it since it didn't affect it using this Colab's default Pony model but again, I don't know enough about this to say for sure that it wouldn't affect custom URL models

Edit: I managed to get it to work by just starting each training one at a time and waiting for it to start, but now it's ending the training after a single epoch is saved even though I've got it set to make more than 1.

hollowstrawberry commented 1 month ago

you can use this link if you'd like to, I uploaded the model to huggingface :)

Thank you, I'll put this link in the colab if you don't mind. Sad that the previous one got deleted.

AIuser0101 commented 1 month ago

How do you use the use the 'optional_custom_training_model_url:' field in this notebook? I want to upload my own custom checkpoints, can someone please reference me to how I would setup a huggingface model page to do so? I somehow don't think just uploading the safetensor file would work. Thanks,

in the custom url field you place this link or any other model that is in diffusers, if you have colab pro you can place a link of a civitai model in safetensors.

Thank you, but I was wondering about models besides Pony V6 (V7 is coming out soon from what I hear), so like if I wanted to train on the autismSDXL pony model, or some other pony or XL model.

Just pasting the link to the civit download link doesn't work. If anyone knows please let me know. Thanks,

P.S. Thanks for the working PonyV6 link Ariio! I think hollowstrawberry will probably update the notebook, not sure how the diffusers work, I'm used to just linking the whole 6-7gig safetensor model, but it looks like the huggingface page model is less than 1gig. I was just curious how is it pulling the data from the large 6-7 gig checkpoint model.

gwhitez commented 1 month ago

How do you use the use the 'optional_custom_training_model_url:' field in this notebook? I want to upload my own custom checkpoints, can someone please reference me to how I would setup a huggingface model page to do so? I somehow don't think just uploading the safetensor file would work. Thanks,

in the custom url field you place this link or any other model that is in diffusers, if you have colab pro you can place a link of a civitai model in safetensors.

So I'm trying it right now and literally all of my lora trainings are either stopping before they even get to the epochs part, or the ones that make it to the epochs part are sitting at 0% and not moving. I don't know enough about this stuff to be able to tell if I'm the one doing something wrong with it or not. Is it because I'm training 7 at once? I didn't think that'd affect it since it didn't affect it using this Colab's default Pony model but again, I don't know enough about this to say for sure that it wouldn't affect custom URL models

Edit: I managed to get it to work by just starting each training one at a time and waiting for it to start, but now it's ending the training after a single epoch is saved even though I've got it set to make more than 1. I 'm not sure but it could be the optimizer, it happened to me with adaw8bit and I didn't get past the first epoch, but when using prodigy it worked fine.

LegendaryNoCon commented 1 month ago

How do you use the use the 'optional_custom_training_model_url:' field in this notebook? I want to upload my own custom checkpoints, can someone please reference me to how I would setup a huggingface model page to do so? I somehow don't think just uploading the safetensor file would work. Thanks,

in the custom url field you place this link or any other model that is in diffusers, if you have colab pro you can place a link of a civitai model in safetensors.

So I'm trying it right now and literally all of my lora trainings are either stopping before they even get to the epochs part, or the ones that make it to the epochs part are sitting at 0% and not moving. I don't know enough about this stuff to be able to tell if I'm the one doing something wrong with it or not. Is it because I'm training 7 at once? I didn't think that'd affect it since it didn't affect it using this Colab's default Pony model but again, I don't know enough about this to say for sure that it wouldn't affect custom URL models Edit: I managed to get it to work by just starting each training one at a time and waiting for it to start, but now it's ending the training after a single epoch is saved even though I've got it set to make more than 1. I 'm not sure but it could be the optimizer, it happened to me with adaw8bit and I didn't get past the first epoch, but when using prodigy it worked fine.

I'm trying it right now and it's still doing more or less the same thing. I've started a training using Prodigy and it's been stuck at 12% for a long time. It just randomly starts and slows and stalls, and this never happened before, I've been training using this for the past couple weeks and this has never been a problem.

gwhitez commented 1 month ago

I have tried it and with prodigy it does train, have you set the recommended prodigy settings, that could be a solution.

LegendaryNoCon commented 1 month ago

I have tried it and with prodigy it does train, have you set the recommended prodigy settings, that could be a solution.

Yeah, I replace the "args" with what's in the "recommended args for prodigy" and it DOES train, it's just going soooo much slower than it ever has before. And since I don't have Colab Premium, I can't exactly wait around for it to just sit and stall all the time.

dazaibsd commented 1 month ago

you can use this link if you'd like to, I uploaded the model to huggingface :)

Thank you, I'll put this link in the colab if you don't mind. Sad that the previous one got deleted.

It doesn't work properly with that link, it gets stuck at 2%.

gwhitez commented 1 month ago

you can use this link if you'd like to, I uploaded the model to huggingface :)

Thank you, I'll put this link in the colab if you don't mind. Sad that the previous one got deleted.

It doesn't work properly with that link, it gets stuck at 2%.

and with this one? It's the same but I cloned it and with that I trained a Lora before Hollow fixed it

LegendaryNoCon commented 1 month ago

you can use this link if you'd like to, I uploaded the model to huggingface :)

Thank you, I'll put this link in the colab if you don't mind. Sad that the previous one got deleted.

It doesn't work properly with that link, it gets stuck at 2%.

and with this one? It's the same but I cloned it and with that I trained a Lora before Hollow fixed it

I just tried it on that one and it got stuck for over half an hour before I manually stopped it. It got stuck at ten percent, and upon restarting with the fixed notebook's Pony model, it once again stuck at ten percent.

fumize commented 1 month ago

use the today's update trainer,and it's show up this 123

"/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn("

gwhitez commented 1 month ago

Could it be that the model is not in fp16? I remember that the previous model weighed 5gb, and this 10gb could be the reason why the training is slower.

hollowstrawberry commented 1 month ago

use the today's update trainer,and it's show up this

"/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn("

This warning is not relevant, you can ignore it.

hollowstrawberry commented 1 month ago

Could it be that the model is not in fp16? I remember that the previous model weighed 5gb, and this 10gb could be the reason why the training is slower.

Thank you. Indeed. A few minutes ago I have updated the colab by providing my own version of pony v6 in diffusers fp16. That should fix the training stalling and crashing due to memory constraints.

gwhitez commented 1 month ago

train correctly now

image

Ariiio commented 1 month ago

Could it be that the model is not in fp16? I remember that the previous model weighed 5gb, and this 10gb could be the reason why the training is slower.

damn, sorry. I didnt realize

NazarlusMon commented 3 weeks ago

Same issue here, first at the end of training started getting NaNs so I went to restart it only to get greeted by this.

At the beginning of May, I also had "NaNs", but they disappeared for a while... now they have appeared again.