Resume Training - Githubissues

davi-min commented 2 months ago

Can you resume training? I noticed there options to save state, but not sure how to go about resuming the training.

bananasss00 commented 2 months ago

You can add the parameter network_weights=path_to_lora here. I haven't tested it, but it should work.

kijai commented 2 months ago

I can add easier to use node for it, but don't currently have time to test the actual functionality.

kijai commented 2 months ago

It looks like the current kohya code can't load the saved state :(

davi-min commented 2 months ago

It looks like the current kohya code can't load the saved state :(

That's strange. I just tried the option offered by bananasss00 and I thought it had worked. The results are noticeably better. I did change some parameters, though so now I'm not sure if it just looks better because of the changed parameters of if it actually worked.

kijai commented 2 months ago

It looks like the current kohya code can't load the saved state :(

That's strange. I just tried the option offered by bananasss00 and I thought it had worked. The results are noticeably better. I did change some parameters, though so now I'm not sure if it just looks better because of the changed parameters of if it actually worked.

When I try it refuses to even load the saved state due to some key mismatch error, can you point me to the code that you say worked?

pondloso commented 2 months ago

It looks like the current kohya code can't load the saved state :(

That's strange. I just tried the option offered by bananasss00 and I thought it had worked. The results are noticeably better. I did change some parameters, though so now I'm not sure if it just looks better because of the changed parameters of if it actually worked.

When I try it refuses to even load the saved state due to some key mismatch error, can you point me to the code that you say worked?

I think he mean continue train form finished lora not saved state.

kijai commented 1 month ago

Can you try this: In the saved state folder there's the model.safetensors and model_1.safetensors, remove/rename/move the model.safetensors and rename the model_1.safetensors to model.safetensors, then try to resume from that folder. I think it's just mistakenly saving the full model and when resuming trying to load it as LoRA.

gloobnib commented 1 month ago

Can you try this: In the saved state folder there's the model.safetensors and model_1.safetensors, remove/rename/move the model.safetensors and rename the model_1.safetensors to model.safetensors, then try to resume from that folder. I think it's just mistakenly saving the full model and when resuming trying to load it as LoRA.

Not the original poster, but this solved it for me. Or more specifically, I had to rename model_2.safetensors to model.safetensors. I also had a model_1.safetensors in the state folder that was also almost the same size as the original model.safetensors. The model_2.safetensors was around 250MB, whereas the other two were around 6GB in size.

kijai commented 1 month ago

Can you try this: In the saved state folder there's the model.safetensors and model_1.safetensors, remove/rename/move the model.safetensors and rename the model_1.safetensors to model.safetensors, then try to resume from that folder. I think it's just mistakenly saving the full model and when resuming trying to load it as LoRA.

Not the original poster, but this solved it for me. Or more specifically, I had to rename model_2.safetensors to model.safetensors. I also had a model_1.safetensors in the state folder that was also almost the same size as the original model.safetensors. The model_2.safetensors was around 250MB, whereas the other two were around 6GB in size.

It should no longer save the whole model in the future, I've yet to try if the actual resuming works though result wise, did you?

JohnnyJae commented 1 month ago

Can you try this: In the saved state folder there's the model.safetensors and model_1.safetensors, remove/rename/move the model.safetensors and rename the model_1.safetensors to model.safetensors, then try to resume from that folder. I think it's just mistakenly saving the full model and when resuming trying to load it as LoRA.

Not the original poster, but this solved it for me. Or more specifically, I had to rename model_2.safetensors to model.safetensors. I also had a model_1.safetensors in the state folder that was also almost the same size as the original model.safetensors. The model_2.safetensors was around 250MB, whereas the other two were around 6GB in size.

It should no longer save the whole model in the future, I've yet to try if the actual resuming works though result wise, did you?

I tried it, then compared the results from the first one (2000 steps) with the one that resumed from 2000 steps and the images were different, but it was hard to tell if it worked since it was a style lora and the images were too similar. My guess is that it didn't work, the images were too similar, but I suck at this so I can't tell.

RaySteve312 commented 1 month ago

Can you try this: In the saved state folder there's the model.safetensors and model_1.safetensors, remove/rename/move the model.safetensors and rename the model_1.safetensors to model.safetensors, then try to resume from that folder. I think it's just mistakenly saving the full model and when resuming trying to load it as LoRA.

Not the original poster, but this solved it for me. Or more specifically, I had to rename model_2.safetensors to model.safetensors. I also had a model_1.safetensors in the state folder that was also almost the same size as the original model.safetensors. The model_2.safetensors was around 250MB, whereas the other two were around 6GB in size.

It should no longer save the whole model in the future, I've yet to try if the actual resuming works though result wise, did you?

I tried it, then compared the results from the first one (2000 steps) with the one that resumed from 2000 steps and the images were different, but it was hard to tell if it worked since it was a style lora and the images were too similar. My guess is that it didn't work, the images were too similar, but I suck at this so I can't tell.

how about the loss graph, it tells something

kijai / ComfyUI-FluxTrainer

Resume Training #17