pytorch / ignite

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.
https://pytorch-ignite.ai
BSD 3-Clause "New" or "Revised" License
4.51k stars 609 forks source link

How to resume learning? #2569

Open kazuma0606 opened 2 years ago

kazuma0606 commented 2 years ago

❓ Questions/Help/Support

Hi, support teams. This is my first time asking a question. I believe the following code will load the checkpoints.

=================================================== checkpoint_path = "/tmp/cycle_gan_checkpoints/checkpoint_26500.pt"

    # let's save this checkpoint to W&B
    if wb_logger is not None:
        wb_logger.save(checkpoint_path)

===================================================

If the learning process takes a long time, there will be interruptions along the way. In such a case, what code can I use to resume learning? We look forward to hearing from you. Regards.

vfdev-5 commented 2 years ago

Hi @kazuma0606 , Please check the following docs and examples:

In few lines of code, you can do the following:

https://github.com/pytorch/ignite/blob/315b6b98012f636034453beb8c3c334229575918/examples/contrib/cifar10/main.py#L334

https://github.com/pytorch/ignite/blob/315b6b98012f636034453beb8c3c334229575918/examples/contrib/cifar10/main.py#L351-L357

HTH

kazuma0606 commented 2 years ago

Thanks for reply. I tryed a following links. https://github.com/pytorch/ignite/blob/master/examples/notebooks/CycleGAN_with_torch_cuda_amp.ipynb

Am I correct in assuming that the code below does not include the epoch and loss information to resume learning? Regards.

from ignite.handlers import ModelCheckpoint, TerminateOnNan
checkpoint_handler = ModelCheckpoint(dirname="/content/drive/My Drive/Colab Notebooks/CycleGAN_Project/pytorch-CycleGAN-and-pix2pix/datasets/T1W2T2W/cpk", filename_prefix="",require_empty=False)

to_save = {
    "generator_A2B": generator_A2B,
    "discriminator_B": discriminator_B,
    "generator_B2A": generator_B2A,
    "discriminator_A": discriminator_A,

    "optimizer_G": optimizer_G,
    "optimizer_D": optimizer_D,
}

trainer.add_event_handler(Events.ITERATION_COMPLETED(every=500), checkpoint_handler, to_save)
trainer.add_event_handler(Events.ITERATION_COMPLETED, TerminateOnNan())
vfdev-5 commented 2 years ago

@kazuma0606 Yes, you are correct. In order to save epoch and iteration, we need to save trainer:

to_save = {
    "generator_A2B": generator_A2B,
    "discriminator_B": discriminator_B,
    "generator_B2A": generator_B2A,
    "discriminator_A": discriminator_A,

    "optimizer_G": optimizer_G,
    "optimizer_D": optimizer_D,

    "trainer": trainer
}

As for batch loss, there is no need to save it as once restored models they would give similar batch loss values. As for average running losses RunningAverage, unfortunately, they can't be restored. It's still a feature we would like to have.

kazuma0606 commented 2 years ago

Hi, @vfdev-5 Thanks for reply. I was able to resume training without incident. I am a little curious, is the following feature valuable from an academic point of view? As for average running losses RunningAverage, unfortunately, they can't be restored. It's still a feature we would like to have. Since this is a topic not really related to the case, please send me your e-mail or any social networking accounts. I hope you have an e-mail or a social networking account.

Regards.

vfdev-5 commented 2 years ago

Hi @kazuma0606

I am a little curious, is the following feature valuable from an academic point of view?

I'm not sure about academic PoV but if it is about deterministic training and reproducibility while resuming from a checkpoint there are few things to take into account:

More info: https://pytorch.org/ignite/engine.html#deterministic-training

Since this is a topic not really related to the case, please send me your e-mail or any social networking accounts. I hope you have an e-mail or a social networking account.

We have to communicate with the team :

See also : https://github.com/pytorch/ignite#communication

As for average running losses RunningAverage, unfortunately, they can't be restored. It's still a feature we would like to have.

We can try to prioritize this feature. Related already issue open https://github.com/pytorch/ignite/issues/966

kazuma0606 commented 2 years ago

Hi, @vfdev-5 Sorry for the delay in responding. Thank you for your contact information. I will email you separately on topics not related to this case. By the way, regarding the following notebook. CycleGAN_with_torch_cuda_amp.ipynb In this notebook, For the following functions.

@trainer.on(Events.EPOCH_STARTED)
def run_evaluation(engine):
    evaluator.run(eval_train_loader)
    evaluator.run(eval_test_loader)

def log_generated_images(engine, logger, event_name):

Functions run_evaluation() and log_generated_images() are called automatically at the start of training and can capture variables like lambda expressions. Am I correct in my understanding?

vfdev-5 commented 2 years ago

Hi @kazuma0606

Functions run_evaluation() and log_generated_images() are called automatically at the start of training

Complete code is the following:

@trainer.on(Events.EPOCH_STARTED)
def run_evaluation(engine):
    evaluator.run(eval_train_loader)
    evaluator.run(eval_test_loader)

def log_generated_images(engine, logger, event_name):

    # ...

tb_logger.attach(evaluator,
                 log_handler=log_generated_images, 
                 event_name=Events.COMPLETED)

As you can see trainer has run_evaluation attached on EPOCH_STARTED so, every epoch started it will execute run_evaluation. Then you see that tb_logger attaches log_generated_images on COMPLETED for evaluator engine. Thus, trainer calls run_evaluation where evaluator runs on and once it is done (completed) it calls log_generated_images via tb_logger.

can capture variables like lambda expressions.

Yes, I think you use any variables in these functions from your global scope. If you want to pass explicitly an argument you can do something like:

another_lambda = lambda : "check another lambda"

@trainer.on(Events.EPOCH_STARTED, lambda : "check lambda")
def run_evaluation(engine, fn):
    print(fn(), another_lambda())
kazuma0606 commented 2 years ago

Hi, @vfdev-5 Thank you for the very clear explanation. By the way, the learning was interrupted by the following message in the middle of the learning .

2022-05-18 01:25:03,792 ignite.handlers.terminate_on_nan.TerminateOnNan WARNING: TerminateOnNan: Output '{'loss_generators': nan, 'loss_generator_a2b': nan, 'loss_generator_b2a': 1.1836220026016235, 'loss_discriminators': 0.06603499501943588, 'loss_discriminator_a': 0.10396641492843628, 'loss_discriminator_b': 0.028103578835725784}' contains NaN or Inf. Stop training
TerminateOnNan: Output '{'loss_generators': nan, 'loss_generator_a2b': nan, 'loss_generator_b2a': 1.1836220026016235, 'loss_discriminators': 0.06603499501943588, 'loss_discriminator_a': 0.10396641492843628, 'loss_discriminator_b': 0.028103578835725784}' contains NaN or Inf. Stop training
State:
    iteration: 1878924
    epoch: 81
    epoch_length: 23250
    max_epochs: 200
    max_iters: <class 'NoneType'>
    output: <class 'dict'>
    batch: <class 'dict'>
    metrics: <class 'dict'>
    dataloader: <class 'torch.utils.data.dataloader.DataLoader'>
    seed: <class 'NoneType'>
    times: <class 'dict'>

By the way, is the TerminateOnNan flag a function to suppress over-learning? Also, if this flag is true, is there any point in learning any more? Or, if I increase the number of cases, will I be able to turn more epochs? Does mixed-precision learning also have any positive effects?

Sorry for all the questions. Regards.

kazuma0606 commented 2 years ago

I don't know if this is relevant, but I had to prepare and learning Dataset on my own. I was learning with brain MRI images, but it seems that the type of images for the training and test were different. Is it possible that learning stops early in such cases?

vfdev-5 commented 2 years ago

Hi @kazuma0606

By the way, is the TerminateOnNan flag a function to suppress over-learning?

When loss goes Nan, learning is not possible anymore as weights are Nan as well and we just waste resources. TerminateOnNan handler helps to stop the training as Nan is encountered.

Loss can go Nan in various cases:

Or, if I increase the number of cases, will I be able to turn more epochs?

I'm not sure to understand your point here, sorry

Does mixed-precision learning also have any positive effects?

Yes, less GPU memory usage and faster training on Nvidia GPUs with Turing cores

I was learning with brain MRI images, but it seems that the type of images for the training and test were different. Is it possible that learning stops early in such cases?

I do not think that your data is responsible for Nan, try 2 above points before and see if it helps