Closed aa1212a312312312312 closed 1 month ago
Wait... You got the trainer itself to run with only the CPU Runtime? It always errors out for me when I forget to change it back to the GPU Runtime before the actual training... How's you pull that off?
No not the TRAINER, EVERYTHING before TRAINING actually STARTS. i.e. :
Downloading of everything
Installation of everything
Bucket Creation
ec..
Sent: Sunday, April 14, 2024 at 3:35 AM From: "DogFace" @.> To: "hollowstrawberry/kohya-colab" @.> Cc: "aa1212a312312312312" @.>, "Author" @.> Subject: Re: [hollowstrawberry/kohya-colab] CONTINUE from a PREVIOUS unfinished execution (Issue #134)
Wait... You got the trainer itself to run with only the CPU Runtime? It always errors out for me when I forget to change it back to the GPU Runtime before the actual training... How's you pull that off?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
The continue part would be great, you never know when colab is going to terminate your session. Some months ago a recaptcha challenge meant that you've already used half your time, but nowadays it means the termination is a few minutes away. Saving the training data on Drive would enable a later resume.
For the CPU then GPU part... I believe changing runtimes means closing the session and deleting all contents, so maybe it's not possible.
Okay so here's the problem. When I added this in 1.5, it worked terribly.
So I'm hesitant to add it back now.
Most of the time, the first couple of months in google drive are pretty cheap, many people can make new accounts and new cheap trials just for this. The space issue is not and problem really (cheap trial is 100 gb tho).
Concerning it broke with 1.5, have you tried it with sdxl? Sadly I'm a bit biased here as I only use sdxl, but it failing in 1.5 may not mean it will also fail in sdxl.
I WAS referring to SDXL as default (and I GUESS it also happens in SD)
I also use SDXL only, whenever I can
Sent: Sunday, April 14, 2024 at 11:30 PM From: "SmsmKun" @.> To: "hollowstrawberry/kohya-colab" @.> Cc: "aa1212a312312312312" @.>, "Author" @.> Subject: Re: [hollowstrawberry/kohya-colab] Resume lora training (Issue #134)
Most of the time, the first couple of months in google drive are pretty cheap, many people can make new accounts and new cheap trials just for this. The space issue is not and problem really (cheap trial is 100 gb tho).
Concerning it broke with 1.5, have you tried it with sdxl? Sadly I'm a bit biased here as I only use sdxl, but it failing in 1.5 may not mean it will also fail in sdxl.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Saving the training data each epoch to resume
Traiining Data ?, Is the only training data required just: Config files
Dataset
Buckets (But these can be re-generated, to save diskspace on saving them)
Generated saved check points (filename1.safetensors)
All these do not take up much, Each checkpoint can be say, for example 50MB, and typiscally only 10 are generated maximum (if selected to save this way). The amount generated can be modified/reduced. Also is the FINAL checkpoint not the only one REQUIRED to continue again ?
Sent: Sunday, April 14, 2024 at 11:27 PM From: "Holo" @.> To: "hollowstrawberry/kohya-colab" @.> Cc: "aa1212a312312312312" @.>, "Author" @.> Subject: Re: [hollowstrawberry/kohya-colab] Resume lora training (Issue #134)
Okay so here's the problem. When I added this in 1.5, it worked terribly.
Continuing from a safetensors gave broken results. It was unusable. I don't know if it was ever fixed.
Saving the training data each epoch to resume later fills up your google drive incredibly quickly, as all the temporary files go into the trash can. It's not viable specially if you're not paying for google drive space.
So I'm hesitant to add it back now.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
First, stop replying via email, it screws up your message and makes it very confusing.
The training data consists of several gigabytes. Probably even more in XL. It is the information that is being processed during the training and is needed to correctly resume as if nothing happened.
I just don't think it's worth the effort of trying if it has such a high likelihood of being unusable.
It could work though. I would like to hear from someone who has tried it before in XL (not necessarily on colab).
For Lora trainers (XL Lora Trainer, also maybe Lora Trainer) While training, Google CoLab can terminate the session (probably due to GPU usage, as when reconnecting, GPU is no longer available for use).
Can you implement Continuing from a previous execution?, so that training resumes (when Google CoLab provide a GPU again). As right now, the previous execution results are deleted/ignored.
So a START (from scratch) or a CONTINUE (from previous execution) options are required to be provided.
Also, if execution WITHOUT a GPU, the execution terminates and the OUTPUT does not indicate any failure or termination. Can you implement a very noticable error message stating that TERMINIATION has happened due to no GPU available.
Also, it takes several minutes from pressing START to Getting training underway (due to the setup). The setup does not REQUIRE a GPU, thefore can you implement SETUP and TRAINING as 2 different STEPS, as SETUP can then be preformed without a GPU and when a GPU is available the actual TRAINING step can be executed, this way it will allow the usage of GPU for additional several minutes.