hollowstrawberry / kohya-colab

Accessible Google Colab notebooks for Stable Diffusion Lora training, based on the work of kohya-ss and Linaqruf
GNU General Public License v3.0
599 stars 86 forks source link

Lara who trained with this comes out broken. #98

Closed ddddfre closed 6 months ago

ddddfre commented 6 months ago

Lara who trained with this comes out broken.

ddddfre commented 6 months ago

I checked and found that "loss=nan" keeps coming up.

hollowstrawberry commented 6 months ago

Try lowering both learning rates, something like 3e-4 unet and 3e-5 tenc

ddddfre commented 6 months ago

I did what you said, but 'loss=nan' comes up again. I think there's an error from epoch 2.

dazaibsd commented 6 months ago

Same! I train two Loras just now, and both showed the following error in SD Webui: 'NansException: A tensor with all NaNs was produced in Unet. This could be either because there's not enough precision to represent the picture, or because your video card does not support half type. Try setting the 'Upcast cross attention layer to float32' option in Settings > Stable Diffusion or using the --no-half command line argument to fix this. Use --disable-nan-check command line argument to disable this check.'

I tried the 'Upcast cross attention layer to float32' option and it didn't work.

also i tried with a Lora that I had train a week ago, and that one works perfectly.

ErichEisner commented 6 months ago

same here.

hollowstrawberry commented 6 months ago

I'll have to take a look. Maybe updating the kohya version broke it.

mintkoko commented 6 months ago

I also learned with the same setting as before, but the learning results are not coming out.

ShinobiiSpartan commented 6 months ago

The same thing is happening to me, I've tried to change the unets every time to see if that's the issue but it's always still showing the nan message or the lora comes out as nothing and random artifacts.

ArmyOfPun1776 commented 6 months ago

I'll have to take a look. Maybe updating the kohya version broke it.

Absolutely love this resource! I've trained all my LoRa Models on it. I (and I'm sure many others) really appreciate your time and dedication to keeping the resource up and running. Keep up the great work! I, for one, will be waiting patiently for it to be fixed.

Just thought I'd throw you some words of encouragement. :)

ShinobiiSpartan commented 6 months ago

I'll have to take a look. Maybe updating the kohya version broke it.

Absolutely love this resource! I've trained all my LoRa Models on it. I (and I'm sure many others) really appreciate your time and dedication to keeping the resource up and running. Keep up the great work! I, for one, will be waiting patiently for it to be fixed.

Just thought I'd through you some words of encouragement. :)

Perfectly said man. His work is so good and I too have managed to make so many loras with it.

DASDAWDDWADSADSA commented 6 months ago

Yeah sorry but how long do you think it will take to fix sorry

stadiffs commented 6 months ago

Iwas training several loras but it looks like something is wrong with the new notebook, I notice that the quality is very poor lora that produces unreadable images, can you tell me what is wrong, is something related to the trainer or perhaps the Konya repo?, is clear that something happened 5B3633569C76AF6337E5157EB835EA51F7EAB585A01985B972405E27D9D9F5E7 6B85CA072B841CF75AC778C91316848ABCD6E7B278AEC7BC4BC93AF2D9D2268C (1) 6B85CA072B841CF75AC778C91316848ABCD6E7B278AEC7BC4BC93AF2D9D2268C 8B60C9618D0F02D8476A279DEEF917CB1F6C8221078E81E9ABE336F1F1CD07A2 57258E0901F364FEEE4C5F0A59809A2EF54076B69026A96E349D8E1A715D3014

here some images that I trained before

E643F81F8173387383B9D9CB5DAB26180D0F5A68B85E66FB2EF91A638A1F8844 FB29D0B26540B26478BA56C0F84FE6D0EC8E108894EE916A6CEF3BBA12F624F1

junwoochoi2 commented 6 months ago

network_alpha = 1

TanvirHafiz commented 6 months ago

network_alpha = 1

it did work. thanks. but wondering why is this so?

ErichEisner commented 6 months ago

Thanks for the help ... do we only change the network_alpha (from 8 to 1 for example), or also the network_dim? And does it do something to the quality of the LoRA?

junwoochoi2 commented 6 months ago

network_alpha = 1

it did work. thanks. but wondering why is this so?

I don't know. I just tried it and it worked.

junwoochoi2 commented 6 months ago

Thanks for the help ... do we only change the network_alpha (from 8 to 1 for example), or also the network_dim? And does it do something to the quality of the LoRA?

I'm not sure. I'd appreciate it if you could give it a try and let me know the result.

ErichEisner commented 6 months ago

a) It works b) no need to change network_dim c) quality seems to be good

ddddfre commented 6 months ago

I have the same problem even if I set alpha to 1.

ballenvironment commented 6 months ago

You should reference a specific github commit for kohya instead of main.

The recent changes they made broke this.

Ischafak commented 6 months ago

hi, if i change "network_alpha = 1" for main branch will it work?

@junwoochoi2 can you give me any recommendation about how can i make this work?

gwhitez commented 6 months ago

I tried red alpha at 1 and it works but when I go past epoch 7 it burns and the images are black, so I used 5 epochs and it worked correctly. image

lucaswalkeryoung commented 6 months ago

So... what's the word everyone? Is the trainer still busted?

stadiffs commented 6 months ago

I used alpha 1, no changes, still have issues, can anyone tell me the setting for anime and persons, thanks.

ArmyOfPun1776 commented 6 months ago

So... what's the word everyone? Is the trainer still busted?

A far as I know, yes. There seems to be some work around's in place atm. but we haven't gotten official word from HollowStrawberry yet.

I'm sure he will let everyone know when it's fixed though. Just be patient. I think the latest Kohya_ss update wrecked it.

gwhitez commented 6 months ago

I used alpha 1, no changes, still have issues, can anyone tell me the setting for anime and persons, thanks.

use 5 epochs, that's just how it works

ArmyOfPun1776 commented 6 months ago

use 5 epochs, that's just how it works

Right, So I tested this and while you are correct (The trainer does indeed work at 5 Epochs and a low alpha.), I highly doubt that this is the intended behavior of the trainer from this point forward. Which, intentional or not, is what you're comment makes it sound like.

The creator, @hollowstrawberry , has not said any such thing, and they have always been really good about interacting with us here. I'm sure they're hard at work, when they can be, to fix the issue.

Let's not forget that these creators have lives outside of their projects. They may have a family, an irl job, and/or normal life stuff going on.

If you want to use the trainer at it's current capacity that is your prerogative. But you didn't write, build, or post this project. Responding like you did above in such a "matter of fact" way undermines the creators hard work and could turn new and current users, who don't know any better, away.

For those of you asking how to get the trainer to work: It is broken at the moment. As is evident with this issue thread still being OPEN, It has been broken for at least a couple of days.

Over the past 48 hours the trainer has gone from completely unusable to it's current state. (Somewhat useable) This is progress. AGAIN, I'm 100% sure that they will update this thread and let us know when it is fixed.

They have expressed their appreciation for our use of their project by being kind enough to not only provide it to us free of charge, but also maintain it the same. Let's not inundate them with the same comments over and over again. Just Stay Calm and Generate with what you've got in the mean time.

Check your email for updates to this tread as often as you feel like it. And for goodness sake; Read the previous comments before posting a new one. That's ,quite literally, all the majority of us can do at this time.

Fin

Ischafak commented 6 months ago

i used 5 epoch and network_alpha 1 and its still not working right btw. Lora just gives me geometrical symbols

DASDAWDDWADSADSA commented 6 months ago

use 5 epochs, that's just how it works

Right, So I tested this and while you are correct (The trainer does indeed work at 5 Epochs and a low alpha.), I highly doubt that this is the intended behavior of the trainer from this point forward. Which, intentional or not, is what you're comment makes it sound like.

The creator, @hollowstrawberry , has not said any such thing, and they have always been really good about interacting with us here. I'm sure they're hard at work, when they can be, to fix the issue.

Let's not forget that these creators have lives outside of their projects. They may have a family, an irl job, and/or normal life stuff going on.

If you want to use the trainer at it's current capacity that is your prerogative. But you didn't write, build, or post this project. Responding like you did above in such a "matter of fact" way undermines the creators hard work and could turn new and current users, who don't know any better, away.

For those of you asking how to get the trainer to work: It is broken at the moment. As is evident with this issue thread still being OPEN, It has been broken for at least a couple of days.

Over the past 48 hours the trainer has gone from completely unusable to it's current state. (Somewhat useable) This is progress. AGAIN, I'm 100% sure that they will update this thread and let us know when it is fixed.

They have expressed their appreciation for our use of their project by being kind enough to not only provide it to us free of charge, but also maintain it the same. Let's not inundate them with the same comments over and over again. Just Stay Calm and Generate with what you've got in the mean time.

Check your email for updates to this tread as often as you feel like it. And for goodness sake; Read the previous comments before posting a new one. That's ,quite literally, all the majority of us can do at this time.

Fin

true and it's not the devs fault it's the big companies

ArmyOfPun1776 commented 6 months ago

i used 5 epoch and network_alpha 1 and its still not working right btw. Lora just gives me geometrical symbols

If you want to use the trainer at it's current capacity that is your prerogative.

For those of you asking how to get the trainer to work: It is broken at the moment. As is evident with this issue thread still being OPEN, It has been broken for at least a couple of days.

Over the past 48 hours the trainer has gone from completely unusable to it's current state. (Somewhat useable) This is progress. AGAIN, I'm 100% sure that they will update this thread and let us know when it is fixed.

willian1986 commented 6 months ago

For me, setteng "network_alpha = 1", fixed the issue.

ErichEisner commented 6 months ago

That's what I thought, too (see my post above) ... but after a testing the LoRA, I'd say that the quality is not as good as the quality of the LoRAs I trained before the problems here.

ArmyOfPun1776 commented 6 months ago

For me, setteng "network_alpha = 1", fixed the issue.

By setting the Alpha Net to 1 you are not fixing the issue. You're just setting the alpha so low that you don't experience Nan Loss. That's it.

Unless you are training a Style or a super generic character, you're going to have a really poor quality LoRa at network alpha 1.

Network Rank and Network Alpha play an important role in the cook of the LoRa. Network Alpha should be at least half of the Network Rank. (i.e.: 32R-16A, 64R-32A, 128R-64A). Sacrificing on either of them is not recommended in the slightest.

Just wait for the actual fix from the Dev/Creator.

ArmyOfPun1776 commented 6 months ago

I feel like I need to prove a point here. So here is an example for those telling people to use Net Alpha 1. These are "Bare Generations". ( meaning , Activator )

Here are the first 3 bare generations with a LoRa trained at Net Rank 32 Net Alpha 1:

00000-1174718702 00001-265992758 00002-4116064999

Her are the first 3 bare generations of my first LoRa ever. It is, by far the worst quality LoRa I have ever made without trying. Net Rank 32 Net Alpha 16:

00003-3634950913 00004-2937727342 00005-3452292457

DON'T TRAIN AT NET ALPHA 1.

hollowstrawberry commented 6 months ago

Hey everyone, I've gone back to a stable kohya version (I was using an XL fork previously, which broke it) and fixed the dependencies. It seems it works now. Let me know how it goes.

hollowstrawberry commented 6 months ago

Dammit I still get nan. I don't get it, we're back to the previous torch version and the previous kohya version...

githubnoot commented 6 months ago

Dammit I still get nan. I don't get it, we're back to the previous torch version and the previous kohya version...

I've been bouncing between previous and newer versions. I did train a LoRa today with success, then trained another a couple hours later and got the CalledProcessError / returned non-zero exit status 1 error again.

Currently training the same error'd LoRa but on a previous version and it's working, unsure of the visual result though since it's still going.

ArmyOfPun1776 commented 6 months ago

Dammit I still get nan. I don't get it, we're back to the previous torch version and the previous kohya version...

I just started messing around with Python about 2 months ago. Wish I knew enough to help. You'll figure it out though.

nothelloearth1 commented 6 months ago

Dammit I still get nan. I don't get it, we're back to the previous torch version and the previous kohya version...

Don't know if you have tested it but the colab default learning rate (5e-4 unet and 1e-4 tenc) will result in nan. Interestingly, though, these values can be used just fine if training locally. Still, I'm not a dev so I don't know if anything goes wrong anywhere.

TanvirHafiz commented 6 months ago

probably the author needs to go through with it again i guess. everybody wait up for the problems to be resolved!

hollowstrawberry commented 6 months ago

Can someone confirm if the XL trainer is working correctly?

DASDAWDDWADSADSA commented 6 months ago

Did it stop working again?

gwhitez commented 6 months ago

Can someone confirm if the XL trainer is working correctly?

I am using it right now and so far it is training well. image

TanvirHafiz commented 6 months ago

you know the process ran fine with me, i got my 10 epochs. but the result of the lora is black screen, so the code and everything is fine, probably needs a setting adjustment. strangely enough network alpha = 1 works. so its just settings. i guess if people tried various settings someone is bound to find the jackpot!

hollowstrawberry commented 6 months ago

The same settings worked before, so something must be wrong.

ArmyOfPun1776 commented 6 months ago

I've attempted to train multiple times today using different settings to see what the threshold might be. Hoping that it might give a hint as to what's happening. Alas, the only variable that seems to allow the training to continue to the end for me is lowering the Net Alpha to 1.

Lowering the learning rate got me farther with traditional settings but all resulted in Nan Loss after Epoch 2 or 3 out of 10.

Forgive me if I'm wrong, but isn't Nan Loss usually associated with Memory Usage? Could there be some sort of Mem leak issue? I watched Nan specifically while training and the interesting thing to me was that on the Epoch before going into Nan Loss the Loss was .08 to .1, which is great. But the next Epoch is immediate Nan Loss. It's just weird. I usually see gradual decreases before hitting Nan. But this seems to be instant. That's what's making me think Mem leak. I'm probably wrong, but thought I'd mention it just incase.

Betamarrajr commented 6 months ago

The same settings worked before, so something must be wrong.

I've attempted to train multiple times today using different settings to see what the threshold might be. Hoping that it might give a hint as to what's happening. Alas, the only variable that seems to allow the training to continue to the end for me is lowering the Net Alpha to 1.

Lowering the learning rate got me farther with traditional settings but all resulted in Nan Loss after Epoch 2 or 3 out of 10.

Forgive me if I'm wrong, but isn't Nan Loss usually associated with Memory Usage? Could there be some sort of Mem leak issue? I watched Nan specifically while training and the interesting thing to me was that on the Epoch before going into Nan Loss the Loss was .08 to .1, which is great. But the next Epoch is immediate Nan Loss. It's just weird. I usually see gradual decreases before hitting Nan. But this seems to be instant. That's what's making me think Mem leak. I'm probably wrong, but thought I'd mention it just incase.

im not a developer, so what im about to write could be for sure a dumb things but: flagging xformers as false there will be no NAN steps in training BUT the lora wil still broke :(

ArmyOfPun1776 commented 6 months ago

im not a developer, so what im about to write could be for sure a dumb things but: flagging xformers as false there will be no NAN steps in training BUT the lora wil still broke :(

I know what you're talking about. Just so you know though: You don't have to flag xformers. You can actually flag the nan check itself. But doing this, in most cases, will not give different results in generation using the Nan Loss'd LoRa. Both those Command Args also give a big hit to mem usage and performance as your affectively disabling optimizations while using them. That's my experience anyway.

DASDAWDDWADSADSA commented 6 months ago

00068-577628551 the output is looking fine

nothelloearth1 commented 6 months ago

I've attempted to train multiple times today using different settings to see what the threshold might be. Hoping that it might give a hint as to what's happening. Alas, the only variable that seems to allow the training to continue to the end for me is lowering the Net Alpha to 1.

I can confirm that dim=alpha (and potentially alpha=1/2dim with low enough lr?) works just fine.

Lowering the learning rate got me farther with traditional settings but all resulted in Nan Loss after Epoch 2 or 3 out of 10.

1e-4 unet and 5e-5 text encoder works for me. I haven't tested yet but my hypothesis is anything higher than 3e-4 will result in nan.

Forgive me if I'm wrong, but isn't Nan Loss usually associated with Memory Usage? Could there be some sort of Mem leak issue? I watched Nan specifically while training and the interesting thing to me was that on the Epoch before going into Nan Loss the Loss was .08 to .1, which is great. But the next Epoch is immediate Nan Loss. It's just weird. I usually see gradual decreases before hitting Nan. But this seems to be instant. That's what's making me think Mem leak. I'm probably wrong, but thought I'd mention it just incase.

No, not really. Nan is either because of the checkpoint itself/vae, and/or learning rate too high.