Closed ddddfre closed 6 months ago
I checked and found that "loss=nan" keeps coming up.
Try lowering both learning rates, something like 3e-4 unet and 3e-5 tenc
I did what you said, but 'loss=nan' comes up again. I think there's an error from epoch 2.
Same! I train two Loras just now, and both showed the following error in SD Webui: 'NansException: A tensor with all NaNs was produced in Unet. This could be either because there's not enough precision to represent the picture, or because your video card does not support half type. Try setting the 'Upcast cross attention layer to float32' option in Settings > Stable Diffusion or using the --no-half command line argument to fix this. Use --disable-nan-check command line argument to disable this check.'
I tried the 'Upcast cross attention layer to float32' option and it didn't work.
also i tried with a Lora that I had train a week ago, and that one works perfectly.
same here.
I'll have to take a look. Maybe updating the kohya version broke it.
I also learned with the same setting as before, but the learning results are not coming out.
The same thing is happening to me, I've tried to change the unets every time to see if that's the issue but it's always still showing the nan message or the lora comes out as nothing and random artifacts.
I'll have to take a look. Maybe updating the kohya version broke it.
Absolutely love this resource! I've trained all my LoRa Models on it. I (and I'm sure many others) really appreciate your time and dedication to keeping the resource up and running. Keep up the great work! I, for one, will be waiting patiently for it to be fixed.
Just thought I'd throw you some words of encouragement. :)
I'll have to take a look. Maybe updating the kohya version broke it.
Absolutely love this resource! I've trained all my LoRa Models on it. I (and I'm sure many others) really appreciate your time and dedication to keeping the resource up and running. Keep up the great work! I, for one, will be waiting patiently for it to be fixed.
Just thought I'd through you some words of encouragement. :)
Perfectly said man. His work is so good and I too have managed to make so many loras with it.
Yeah sorry but how long do you think it will take to fix sorry
Iwas training several loras but it looks like something is wrong with the new notebook, I notice that the quality is very poor lora that produces unreadable images, can you tell me what is wrong, is something related to the trainer or perhaps the Konya repo?, is clear that something happened
here some images that I trained before
network_alpha = 1
network_alpha = 1
it did work. thanks. but wondering why is this so?
Thanks for the help ... do we only change the network_alpha (from 8 to 1 for example), or also the network_dim? And does it do something to the quality of the LoRA?
network_alpha = 1
it did work. thanks. but wondering why is this so?
I don't know. I just tried it and it worked.
Thanks for the help ... do we only change the network_alpha (from 8 to 1 for example), or also the network_dim? And does it do something to the quality of the LoRA?
I'm not sure. I'd appreciate it if you could give it a try and let me know the result.
a) It works b) no need to change network_dim c) quality seems to be good
I have the same problem even if I set alpha to 1.
You should reference a specific github commit for kohya instead of main.
The recent changes they made broke this.
hi, if i change "network_alpha = 1" for main branch will it work?
@junwoochoi2 can you give me any recommendation about how can i make this work?
I tried red alpha at 1 and it works but when I go past epoch 7 it burns and the images are black, so I used 5 epochs and it worked correctly.
So... what's the word everyone? Is the trainer still busted?
I used alpha 1, no changes, still have issues, can anyone tell me the setting for anime and persons, thanks.
So... what's the word everyone? Is the trainer still busted?
A far as I know, yes. There seems to be some work around's in place atm. but we haven't gotten official word from HollowStrawberry yet.
I'm sure he will let everyone know when it's fixed though. Just be patient. I think the latest Kohya_ss update wrecked it.
I used alpha 1, no changes, still have issues, can anyone tell me the setting for anime and persons, thanks.
use 5 epochs, that's just how it works
use 5 epochs, that's just how it works
Right, So I tested this and while you are correct (The trainer does indeed work at 5 Epochs and a low alpha.), I highly doubt that this is the intended behavior of the trainer from this point forward. Which, intentional or not, is what you're comment makes it sound like.
The creator, @hollowstrawberry , has not said any such thing, and they have always been really good about interacting with us here. I'm sure they're hard at work, when they can be, to fix the issue.
Let's not forget that these creators have lives outside of their projects. They may have a family, an irl job, and/or normal life stuff going on.
If you want to use the trainer at it's current capacity that is your prerogative. But you didn't write, build, or post this project. Responding like you did above in such a "matter of fact" way undermines the creators hard work and could turn new and current users, who don't know any better, away.
For those of you asking how to get the trainer to work: It is broken at the moment. As is evident with this issue thread still being OPEN, It has been broken for at least a couple of days.
Over the past 48 hours the trainer has gone from completely unusable to it's current state. (Somewhat useable) This is progress. AGAIN, I'm 100% sure that they will update this thread and let us know when it is fixed.
They have expressed their appreciation for our use of their project by being kind enough to not only provide it to us free of charge, but also maintain it the same. Let's not inundate them with the same comments over and over again. Just Stay Calm and Generate with what you've got in the mean time.
Check your email for updates to this tread as often as you feel like it. And for goodness sake; Read the previous comments before posting a new one. That's ,quite literally, all the majority of us can do at this time.
Fin
i used 5 epoch and network_alpha 1 and its still not working right btw. Lora just gives me geometrical symbols
use 5 epochs, that's just how it works
Right, So I tested this and while you are correct (The trainer does indeed work at 5 Epochs and a low alpha.), I highly doubt that this is the intended behavior of the trainer from this point forward. Which, intentional or not, is what you're comment makes it sound like.
The creator, @hollowstrawberry , has not said any such thing, and they have always been really good about interacting with us here. I'm sure they're hard at work, when they can be, to fix the issue.
Let's not forget that these creators have lives outside of their projects. They may have a family, an irl job, and/or normal life stuff going on.
If you want to use the trainer at it's current capacity that is your prerogative. But you didn't write, build, or post this project. Responding like you did above in such a "matter of fact" way undermines the creators hard work and could turn new and current users, who don't know any better, away.
For those of you asking how to get the trainer to work: It is broken at the moment. As is evident with this issue thread still being OPEN, It has been broken for at least a couple of days.
Over the past 48 hours the trainer has gone from completely unusable to it's current state. (Somewhat useable) This is progress. AGAIN, I'm 100% sure that they will update this thread and let us know when it is fixed.
They have expressed their appreciation for our use of their project by being kind enough to not only provide it to us free of charge, but also maintain it the same. Let's not inundate them with the same comments over and over again. Just Stay Calm and Generate with what you've got in the mean time.
Check your email for updates to this tread as often as you feel like it. And for goodness sake; Read the previous comments before posting a new one. That's ,quite literally, all the majority of us can do at this time.
Fin
true and it's not the devs fault it's the big companies
i used 5 epoch and network_alpha 1 and its still not working right btw. Lora just gives me geometrical symbols
If you want to use the trainer at it's current capacity that is your prerogative.
For those of you asking how to get the trainer to work: It is broken at the moment. As is evident with this issue thread still being OPEN, It has been broken for at least a couple of days.
Over the past 48 hours the trainer has gone from completely unusable to it's current state. (Somewhat useable) This is progress. AGAIN, I'm 100% sure that they will update this thread and let us know when it is fixed.
For me, setteng "network_alpha = 1", fixed the issue.
That's what I thought, too (see my post above) ... but after a testing the LoRA, I'd say that the quality is not as good as the quality of the LoRAs I trained before the problems here.
For me, setteng "network_alpha = 1", fixed the issue.
By setting the Alpha Net to 1 you are not fixing the issue. You're just setting the alpha so low that you don't experience Nan Loss. That's it.
Unless you are training a Style or a super generic character, you're going to have a really poor quality LoRa at network alpha 1.
Network Rank and Network Alpha play an important role in the cook of the LoRa. Network Alpha should be at least half of the Network Rank. (i.e.: 32R-16A, 64R-32A, 128R-64A). Sacrificing on either of them is not recommended in the slightest.
Just wait for the actual fix from the Dev/Creator.
I feel like I need to prove a point here. So here is an example for those telling people to use Net Alpha 1.
These are "Bare Generations". ( meaning
Here are the first 3 bare generations with a LoRa trained at Net Rank 32 Net Alpha 1:
Her are the first 3 bare generations of my first LoRa ever. It is, by far the worst quality LoRa I have ever made without trying. Net Rank 32 Net Alpha 16:
DON'T TRAIN AT NET ALPHA 1.
Hey everyone, I've gone back to a stable kohya version (I was using an XL fork previously, which broke it) and fixed the dependencies. It seems it works now. Let me know how it goes.
Dammit I still get nan. I don't get it, we're back to the previous torch version and the previous kohya version...
Dammit I still get nan. I don't get it, we're back to the previous torch version and the previous kohya version...
I've been bouncing between previous and newer versions. I did train a LoRa today with success, then trained another a couple hours later and got the CalledProcessError / returned non-zero exit status 1 error again.
Currently training the same error'd LoRa but on a previous version and it's working, unsure of the visual result though since it's still going.
Dammit I still get nan. I don't get it, we're back to the previous torch version and the previous kohya version...
I just started messing around with Python about 2 months ago. Wish I knew enough to help. You'll figure it out though.
Dammit I still get nan. I don't get it, we're back to the previous torch version and the previous kohya version...
Don't know if you have tested it but the colab default learning rate (5e-4 unet and 1e-4 tenc) will result in nan. Interestingly, though, these values can be used just fine if training locally. Still, I'm not a dev so I don't know if anything goes wrong anywhere.
probably the author needs to go through with it again i guess. everybody wait up for the problems to be resolved!
Can someone confirm if the XL trainer is working correctly?
Did it stop working again?
Can someone confirm if the XL trainer is working correctly?
I am using it right now and so far it is training well.
you know the process ran fine with me, i got my 10 epochs. but the result of the lora is black screen, so the code and everything is fine, probably needs a setting adjustment. strangely enough network alpha = 1 works. so its just settings. i guess if people tried various settings someone is bound to find the jackpot!
The same settings worked before, so something must be wrong.
I've attempted to train multiple times today using different settings to see what the threshold might be. Hoping that it might give a hint as to what's happening. Alas, the only variable that seems to allow the training to continue to the end for me is lowering the Net Alpha to 1.
Lowering the learning rate got me farther with traditional settings but all resulted in Nan Loss after Epoch 2 or 3 out of 10.
Forgive me if I'm wrong, but isn't Nan Loss usually associated with Memory Usage? Could there be some sort of Mem leak issue? I watched Nan specifically while training and the interesting thing to me was that on the Epoch before going into Nan Loss the Loss was .08 to .1, which is great. But the next Epoch is immediate Nan Loss. It's just weird. I usually see gradual decreases before hitting Nan. But this seems to be instant. That's what's making me think Mem leak. I'm probably wrong, but thought I'd mention it just incase.
The same settings worked before, so something must be wrong.
I've attempted to train multiple times today using different settings to see what the threshold might be. Hoping that it might give a hint as to what's happening. Alas, the only variable that seems to allow the training to continue to the end for me is lowering the Net Alpha to 1.
Lowering the learning rate got me farther with traditional settings but all resulted in Nan Loss after Epoch 2 or 3 out of 10.
Forgive me if I'm wrong, but isn't Nan Loss usually associated with Memory Usage? Could there be some sort of Mem leak issue? I watched Nan specifically while training and the interesting thing to me was that on the Epoch before going into Nan Loss the Loss was .08 to .1, which is great. But the next Epoch is immediate Nan Loss. It's just weird. I usually see gradual decreases before hitting Nan. But this seems to be instant. That's what's making me think Mem leak. I'm probably wrong, but thought I'd mention it just incase.
im not a developer, so what im about to write could be for sure a dumb things but: flagging xformers as false there will be no NAN steps in training BUT the lora wil still broke :(
im not a developer, so what im about to write could be for sure a dumb things but: flagging xformers as false there will be no NAN steps in training BUT the lora wil still broke :(
I know what you're talking about. Just so you know though: You don't have to flag xformers. You can actually flag the nan check itself. But doing this, in most cases, will not give different results in generation using the Nan Loss'd LoRa. Both those Command Args also give a big hit to mem usage and performance as your affectively disabling optimizations while using them. That's my experience anyway.
the output is looking fine
I've attempted to train multiple times today using different settings to see what the threshold might be. Hoping that it might give a hint as to what's happening. Alas, the only variable that seems to allow the training to continue to the end for me is lowering the Net Alpha to 1.
I can confirm that dim=alpha (and potentially alpha=1/2dim with low enough lr?) works just fine.
Lowering the learning rate got me farther with traditional settings but all resulted in Nan Loss after Epoch 2 or 3 out of 10.
1e-4 unet and 5e-5 text encoder works for me. I haven't tested yet but my hypothesis is anything higher than 3e-4 will result in nan.
Forgive me if I'm wrong, but isn't Nan Loss usually associated with Memory Usage? Could there be some sort of Mem leak issue? I watched Nan specifically while training and the interesting thing to me was that on the Epoch before going into Nan Loss the Loss was .08 to .1, which is great. But the next Epoch is immediate Nan Loss. It's just weird. I usually see gradual decreases before hitting Nan. But this seems to be instant. That's what's making me think Mem leak. I'm probably wrong, but thought I'd mention it just incase.
No, not really. Nan is either because of the checkpoint itself/vae, and/or learning rate too high.
Lara who trained with this comes out broken.