Resume lora training - Githubissues

aa1212a312312312312 commented 5 months ago

For Lora trainers (XL Lora Trainer, also maybe Lora Trainer) While training, Google CoLab can terminate the session (probably due to GPU usage, as when reconnecting, GPU is no longer available for use).

Can you implement Continuing from a previous execution?, so that training resumes (when Google CoLab provide a GPU again). As right now, the previous execution results are deleted/ignored.

So a START (from scratch) or a CONTINUE (from previous execution) options are required to be provided.

Also, if execution WITHOUT a GPU, the execution terminates and the OUTPUT does not indicate any failure or termination. Can you implement a very noticable error message stating that TERMINIATION has happened due to no GPU available.

Also, it takes several minutes from pressing START to Getting training underway (due to the setup). The setup does not REQUIRE a GPU, thefore can you implement SETUP and TRAINING as 2 different STEPS, as SETUP can then be preformed without a GPU and when a GPU is available the actual TRAINING step can be executed, this way it will allow the usage of GPU for additional several minutes.

ArmyOfPun1776 commented 5 months ago

Wait... You got the trainer itself to run with only the CPU Runtime? It always errors out for me when I forget to change it back to the GPU Runtime before the actual training... How's you pull that off?

aa1212a312312312312 commented 5 months ago

No not the TRAINER, EVERYTHING before TRAINING actually STARTS. i.e. :

Downloading of everything

Installation of everything

Bucket Creation

ec..

Sent: Sunday, April 14, 2024 at 3:35 AM From: "DogFace" @.> To: "hollowstrawberry/kohya-colab" @.> Cc: "aa1212a312312312312" @.>, "Author" @.> Subject: Re: [hollowstrawberry/kohya-colab] CONTINUE from a PREVIOUS unfinished execution (Issue #134)

Wait... You got the trainer itself to run with only the CPU Runtime? It always errors out for me when I forget to change it back to the GPU Runtime before the actual training... How's you pull that off?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

femscat commented 5 months ago

The continue part would be great, you never know when colab is going to terminate your session. Some months ago a recaptcha challenge meant that you've already used half your time, but nowadays it means the termination is a few minutes away. Saving the training data on Drive would enable a later resume.

For the CPU then GPU part... I believe changing runtimes means closing the session and deleting all contents, so maybe it's not possible.

hollowstrawberry commented 5 months ago

Okay so here's the problem. When I added this in 1.5, it worked terribly.

Continuing from a safetensors gave broken results. It was unusable. I don't know if it was ever fixed. And if it was, it's worse than the other option.
Saving the training data each epoch to resume later fills up your google drive incredibly quickly, as all the temporary files go into the trash can. It's not viable specially if you're not paying for google drive space.

So I'm hesitant to add it back now.

SmSmKun commented 5 months ago

Most of the time, the first couple of months in google drive are pretty cheap, many people can make new accounts and new cheap trials just for this. The space issue is not and problem really (cheap trial is 100 gb tho).

Concerning it broke with 1.5, have you tried it with sdxl? Sadly I'm a bit biased here as I only use sdxl, but it failing in 1.5 may not mean it will also fail in sdxl.

aa1212a312312312312 commented 5 months ago

I WAS referring to SDXL as default (and I GUESS it also happens in SD)

I also use SDXL only, whenever I can

Sent: Sunday, April 14, 2024 at 11:30 PM From: "SmsmKun" @.> To: "hollowstrawberry/kohya-colab" @.> Cc: "aa1212a312312312312" @.>, "Author" @.> Subject: Re: [hollowstrawberry/kohya-colab] Resume lora training (Issue #134)

Most of the time, the first couple of months in google drive are pretty cheap, many people can make new accounts and new cheap trials just for this. The space issue is not and problem really (cheap trial is 100 gb tho).

Concerning it broke with 1.5, have you tried it with sdxl? Sadly I'm a bit biased here as I only use sdxl, but it failing in 1.5 may not mean it will also fail in sdxl.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

aa1212a312312312312 commented 5 months ago

Saving the training data each epoch to resume

Traiining Data ?, Is the only training data required just: Config files

Dataset

Buckets (But these can be re-generated, to save diskspace on saving them)

Generated saved check points (filename1.safetensors)

All these do not take up much, Each checkpoint can be say, for example 50MB, and typiscally only 10 are generated maximum (if selected to save this way). The amount generated can be modified/reduced. Also is the FINAL checkpoint not the only one REQUIRED to continue again ?

Sent: Sunday, April 14, 2024 at 11:27 PM From: "Holo" @.> To: "hollowstrawberry/kohya-colab" @.> Cc: "aa1212a312312312312" @.>, "Author" @.> Subject: Re: [hollowstrawberry/kohya-colab] Resume lora training (Issue #134)

Okay so here's the problem. When I added this in 1.5, it worked terribly.

Continuing from a safetensors gave broken results. It was unusable. I don't know if it was ever fixed.
Saving the training data each epoch to resume later fills up your google drive incredibly quickly, as all the temporary files go into the trash can. It's not viable specially if you're not paying for google drive space.

So I'm hesitant to add it back now.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

hollowstrawberry commented 5 months ago

First, stop replying via email, it screws up your message and makes it very confusing.

The training data consists of several gigabytes. Probably even more in XL. It is the information that is being processed during the training and is needed to correctly resume as if nothing happened.

I just don't think it's worth the effort of trying if it has such a high likelihood of being unusable.

It could work though. I would like to hear from someone who has tried it before in XL (not necessarily on colab).

hollowstrawberry / kohya-colab

Resume lora training #134