victoresque / pytorch-template

PyTorch deep learning projects made easy.
MIT License
4.7k stars 1.08k forks source link

Command line options #25

Closed SunQpark closed 5 years ago

SunQpark commented 5 years ago

I'm considering to add more command line options to this project. For now, what we have are

  1. --config to start new training
  2. --resume to continue training of saved checkpoint.

In current setup, I feels the following problems

  1. manually delete checkpoint folder, when cancelling the training and run again
  2. config is overridden when checkpoint is being loaded

Hence, the options I suggests are as following

  1. -f, --force option to clean up previous checkpoints if it already exists
  2. using -c and -r at the same time for loading checkpoint but uses given config

What I want most is to add the first, -f option, which seems to be not that hard to implement, not crashing with other parts of project. The second option looks quite difficult to add since the difference between config files should be handled carefully, but we need this to enable fine-tuning process with this template.

Are there any other suggestions or opinions??

amjltc295 commented 5 years ago

Perhaps the timestamp could be added in the checkpoint folder name? The -f cleanup might accidentally delete well-trained checkpoints.

borgesa commented 5 years ago

Hi,

Force

I really agree with the need for a 'force'-option. I think '-force' is a good name for it.

Questions, reg. --force: Should it only remove the checkpoint folder, or also check for and remove the TensorBoard files? For the latter, I guess it will have to read the tensorboard file save path from the config of the 'forced' config file?

Combining 'c' and 'r'

This sounds like good functionality.

I think it is ok to give the user 'full power and full responsibility' for this. Then there would at least be less things to check and easier to implement:

Do you agree?

Functionality questions:

Timestamping folder names

@amjltc295: I think the default status of 'force' would be off.

That said, you raise a good point, and maybe timestamped folder names could be an an additional option to argparse?

I think timestamped folders would be nice: One less parameter to change between runs (and if we use American date-format "MMDD-time" there's a sequential sorting).

SunQpark commented 5 years ago

about --force option

Yes, @amjltc295 the --force option I am currently thinking of will delete files, so that can accidentally delete good files too. But I think that level of risk is not a problem for users that running a training without changing the training name, and with --force option may delete file. We can make a warning displayed.

Another option is to just save checkpoint on the same folder, without deleting. In this case, only the model_best checkpoint will be overridden. Or maybe in the future, we can change the folder tree somehow to make them saved in separate folder.

@borgesa I am not thinking of deleting the tensorboard log file at the same time, because it will crash when tensorboard is running.

Combining -c and -r

Yes, I agree. Maybe we can first do that by loading both of the saved and given config file, comparing them finding difference, displaying warning or interrupting training if there are some critical difference like architecture.

Maybe making some proper organized checkpoint structure / naming first will really help us doing this. I agree that timestamped folder would be good to use here.

SunQpark commented 5 years ago

Adding -f is done. checkpoints are now saved under saved/training_name/timestamp/checkpoint_epoch_n, with timestamp in mmdd_HHMMSS format. With this structure, the checkpoints do not needed to be deleted with -f tag. Combination of -c and -r will also be updated soon.