Open Yo1up opened 9 months ago
It's probably a memory leak and some GPU memory is swapped out into "shared GPU memory" (i.e. main memory), which will slow down the training considerably (and still noticeably even if it's just a small amount that's being swapped). I have found that each time a sample is generated, that seems to release a bit of GPU memory and most crucially, it seems to move memory that has been swapped into main memory back to the dedicated GPU memory.
In my setup I don't use the samples generated by kohya_ss at all (I have a second GPU that I can run inference on, which produces much higher quality results through the Fooocus web ui), but I still let kohya_ss produce a dummy sample at 512x512 every 500 steps for this exact reason. This is on Windows 10 though so YMMV with other operating systems.
I would also recommend not using the main GPU for anything else during training (i.e. connect your monitor to either an integrated GPU or a cheap secondary GPU) to limit the possibility of other applications allocating VRAM on the GPU you're doing the training on.
I'm running the SDXL LoRA training script on a dataset of ~50,000 images and I have noticed that training slows down considerably after several thousand training steps.
I have observed this behaviour before, however it was never an issue previously as I could get through multiple epochs in a couple of thousand training steps.
training starts at 3.16s/it and begins slowing down around step 6000, I have observed so far that it slows down to 5.32s/it however I have also observed the training script completely halting after sufficient time.
I am using a gradient accumulation of 5 for the training, a rank 8 LoRA, and a network alpha of 4, the target resolution is 1024x1024, prodigy optimizer. I'm running on Ubuntu Linux Server with a 3090Ti
here is the output from nvidia-smi
my system RAM usage is at 11.61GB of 12GB and has been since the start of training.
no programs are running aside from Tailscale and system applications.
here are my flags
and the settings area of the launch script
aaaand I just noticed my learning rate is set incorrectly, but that won't fix the issue.