kohya-ss / sd-scripts

Apache License 2.0
5.15k stars 858 forks source link

Feature Request: Save N epoch and N steps, comma seperated #1727

Open FurkanGozukara opened 3 hours ago

FurkanGozukara commented 3 hours ago

We need to be able to save only certain epochs and steps

Like save epoch 30,35,40,45 and no others or Save step 300,400,500 and no others

Can you please add this option? Thank you @kohya-ss

This became super important for FLUX training since each checkpoint is 24 GB

This is for saving checkpoints but saving state option this way would be nice as well

kohya-ss commented 2 hours ago

I think that the functionality is sufficient if we combine options --save_every_n_epochs and --save_last_n_epochs. Saving checkpoints does take time, but if there is a problem and training ends midway, it would be more of a problem if the checkpoints were not saved.

FurkanGozukara commented 2 hours ago

@kohya-ss it is still not being exactly same

lets say i wanted to save 30, 50, 55, currently this is not possible

also last time i tested --save_last_n_epochs it didnt worked :D it tried to save the 4th saving and after that it is trying to delete thus i had out of space error , i had it as 3

but i am gonna test again lets. i think it should delete last one and after that save next one - thus fully utilize space

dsienra commented 1 hour ago

Quote reply Refer

I set Save last N epochs state to 2, my intention was to have just the last 2 or 3 safetensor checkpoints saved because a disk space restrictions, I saving each 25 epochs, I should set "Save last N epochs state" to 50 if I want to keep the las 2 or 75 to keep the last 3, or it doesn't work this way?