Closed SunQpark closed 5 years ago
Perhaps the timestamp could be added in the checkpoint folder name? The -f
cleanup might accidentally delete well-trained checkpoints.
Hi,
I really agree with the need for a 'force'-option. I think '-force' is a good name for it.
Questions, reg. --force: Should it only remove the checkpoint folder, or also check for and remove the TensorBoard files? For the latter, I guess it will have to read the tensorboard file save path from the config of the 'forced' config file?
This sounds like good functionality.
I think it is ok to give the user 'full power and full responsibility' for this. Then there would at least be less things to check and easier to implement:
Do you agree?
Functionality questions:
@amjltc295: I think the default status of 'force' would be off.
That said, you raise a good point, and maybe timestamped folder names could be an an additional option to argparse?
I think timestamped folders would be nice: One less parameter to change between runs (and if we use American date-format "MMDD-time" there's a sequential sorting).
--force
optionYes, @amjltc295 the --force
option I am currently thinking of will delete files, so that can accidentally delete good files too. But I think that level of risk is not a problem for users that running a training without changing the training name, and with --force
option may delete file. We can make a warning displayed.
Another option is to just save checkpoint on the same folder, without deleting. In this case, only the model_best
checkpoint will be overridden. Or maybe in the future, we can change the folder tree somehow to make them saved in separate folder.
@borgesa I am not thinking of deleting the tensorboard log file at the same time, because it will crash when tensorboard is running.
-c
and -r
Yes, I agree. Maybe we can first do that by loading both of the saved and given config file, comparing them finding difference, displaying warning or interrupting training if there are some critical difference like architecture.
Maybe making some proper organized checkpoint structure / naming first will really help us doing this. I agree that timestamped folder would be good to use here.
Adding -f
is done. checkpoints are now saved under saved/training_name/timestamp/checkpoint_epoch_n
, with timestamp in mmdd_HHMMSS
format.
With this structure, the checkpoints do not needed to be deleted with -f
tag.
Combination of -c
and -r
will also be updated soon.
I'm considering to add more command line options to this project. For now, what we have are
--config
to start new training--resume
to continue training of saved checkpoint.In current setup, I feels the following problems
Hence, the options I suggests are as following
-f
,--force
option to clean up previous checkpoints if it already exists-c
and-r
at the same time for loading checkpoint but uses given configWhat I want most is to add the first,
-f
option, which seems to be not that hard to implement, not crashing with other parts of project. The second option looks quite difficult to add since the difference between config files should be handled carefully, but we need this to enable fine-tuning process with this template.Are there any other suggestions or opinions??