Save Interval Checkpoints?

aferriss commented 3 years ago

Hi there! I'm having trouble getting the --save_interval flag to work. I set it to 1000 to test, for my last round of training, but no checkpoints were saved out. Am I right in assuming the repo is setup to only save a single checkpoint at the end of max_steps?

omertov commented 3 years ago

Hi @aferriss! The --save_interval flag defaults to the last training step, but when specified checkpoints should be saved according to the given value.

In case you continue training from a previous checkpoint, please note that it is possible that the value you specify is being overridden by the opts of the training checkpoint being resumed. If it is the case, you can add --update_param_list save_interval to your command args, which will prevent the override of the save_interval flag.

Does the correct value show up when printing the opts dict at the start of training?

Best, Omer

aferriss commented 3 years ago

Hi @omertov, thanks for the quick response!

I ran the same command I ran yesterday again today, but today it did save out checkpoints every 1000 steps. I'm not sure what happened when I ran it before, maybe a little glitch in colab. Anyhow, I'm happy that it's working now!

This is what my script and opts look like for reference:

!python scripts/train.py \
--dataset_type my_data_encode \
--exp_dir /content/drive/MyDrive/e4e\
--start_from_latent_avg \
--use_w_pool \
--w_discriminator_lambda 0.1 \
--progressive_start 20000 \
--id_lambda 0.5 \
--val_interval 10000 \
--max_steps 15000 \
--stylegan_size 1024 \
--stylegan_weights /content/drive/MyDrive/pytorch-models/model.pt \
--workers 8 \
--batch_size 8 \
--test_batch_size 4 \
--test_workers 4 \
--save_interval 1000

{'batch_size': 8,
 'board_interval': 50,
 'checkpoint_path': None,
 'd_reg_every': 16,
 'dataset_type': 'cartoon_encode',
 'delta_norm': 2,
 'delta_norm_lambda': 0.0002,
 'encoder_type': 'Encoder4Editing',
 'exp_dir': '/content/drive/MyDrive/e4e/new7',
 'id_lambda': 0.5,
 'image_interval': 100,
 'keep_optimizer': False,
 'l2_lambda': 1.0,
 'learning_rate': 0.0001,
 'lpips_lambda': 0.8,
 'lpips_type': 'alex',
 'max_steps': 15000,
 'optim_name': 'ranger',
 'progressive_start': 20000,
 'progressive_step_every': 2000,
 'progressive_steps': [0,
                       20000,
                       22000,
                       24000,
                       26000,
                       28000,
                       30000,
                       32000,
                       34000,
                       36000,
                       38000,
                       40000,
                       42000,
                       44000,
                       46000,
                       48000,
                       50000,
                       52000],
 'r1': 10,
 'resume_training_from_ckpt': None,
 'save_interval': 1000,
 'save_training_data': False,
 'start_from_latent_avg': True,
 'stylegan_size': 1024,
 'stylegan_weights': '/content/drive/MyDrive/pytorch-models/cartoon-gan-108.pt',
 'sub_exp_dir': None,
 'test_batch_size': 4,
 'test_workers': 4,
 'train_decoder': False,
 'update_param_list': None,
 'use_w_pool': True,
 'val_interval': 10000,
 'w_discriminator_lambda': 0.1,
 'w_discriminator_lr': 2e-05,
 'w_pool_size': 50,
 'workers': 8}

omertov / encoder4editing

Save Interval Checkpoints? #45