Open kazuma0606 opened 2 years ago
Hi @kazuma0606 , Please check the following docs and examples:
In few lines of code, you can do the following:
HTH
Thanks for reply. I tryed a following links. https://github.com/pytorch/ignite/blob/master/examples/notebooks/CycleGAN_with_torch_cuda_amp.ipynb
Am I correct in assuming that the code below does not include the epoch and loss information to resume learning? Regards.
from ignite.handlers import ModelCheckpoint, TerminateOnNan
checkpoint_handler = ModelCheckpoint(dirname="/content/drive/My Drive/Colab Notebooks/CycleGAN_Project/pytorch-CycleGAN-and-pix2pix/datasets/T1W2T2W/cpk", filename_prefix="",require_empty=False)
to_save = {
"generator_A2B": generator_A2B,
"discriminator_B": discriminator_B,
"generator_B2A": generator_B2A,
"discriminator_A": discriminator_A,
"optimizer_G": optimizer_G,
"optimizer_D": optimizer_D,
}
trainer.add_event_handler(Events.ITERATION_COMPLETED(every=500), checkpoint_handler, to_save)
trainer.add_event_handler(Events.ITERATION_COMPLETED, TerminateOnNan())
@kazuma0606 Yes, you are correct. In order to save epoch and iteration, we need to save trainer
:
to_save = {
"generator_A2B": generator_A2B,
"discriminator_B": discriminator_B,
"generator_B2A": generator_B2A,
"discriminator_A": discriminator_A,
"optimizer_G": optimizer_G,
"optimizer_D": optimizer_D,
"trainer": trainer
}
As for batch loss, there is no need to save it as once restored models they would give similar batch loss values.
As for average running losses RunningAverage
, unfortunately, they can't be restored. It's still a feature we would like to have.
Hi, @vfdev-5 Thanks for reply. I was able to resume training without incident. I am a little curious, is the following feature valuable from an academic point of view? As for average running losses RunningAverage, unfortunately, they can't be restored. It's still a feature we would like to have. Since this is a topic not really related to the case, please send me your e-mail or any social networking accounts. I hope you have an e-mail or a social networking account.
Regards.
Hi @kazuma0606
I am a little curious, is the following feature valuable from an academic point of view?
I'm not sure about academic PoV but if it is about deterministic training and reproducibility while resuming from a checkpoint there are few things to take into account:
More info: https://pytorch.org/ignite/engine.html#deterministic-training
Since this is a topic not really related to the case, please send me your e-mail or any social networking accounts. I hope you have an e-mail or a social networking account.
We have to communicate with the team :
See also : https://github.com/pytorch/ignite#communication
As for average running losses RunningAverage, unfortunately, they can't be restored. It's still a feature we would like to have.
We can try to prioritize this feature. Related already issue open https://github.com/pytorch/ignite/issues/966
Hi, @vfdev-5 Sorry for the delay in responding. Thank you for your contact information. I will email you separately on topics not related to this case. By the way, regarding the following notebook. CycleGAN_with_torch_cuda_amp.ipynb In this notebook, For the following functions.
@trainer.on(Events.EPOCH_STARTED)
def run_evaluation(engine):
evaluator.run(eval_train_loader)
evaluator.run(eval_test_loader)
def log_generated_images(engine, logger, event_name):
Functions run_evaluation() and log_generated_images() are called automatically at the start of training and can capture variables like lambda expressions. Am I correct in my understanding?
Hi @kazuma0606
Functions run_evaluation() and log_generated_images() are called automatically at the start of training
Complete code is the following:
@trainer.on(Events.EPOCH_STARTED)
def run_evaluation(engine):
evaluator.run(eval_train_loader)
evaluator.run(eval_test_loader)
def log_generated_images(engine, logger, event_name):
# ...
tb_logger.attach(evaluator,
log_handler=log_generated_images,
event_name=Events.COMPLETED)
As you can see trainer has run_evaluation
attached on EPOCH_STARTED
so, every epoch started it will execute run_evaluation
. Then you see that tb_logger
attaches log_generated_images
on COMPLETED
for evaluator
engine. Thus, trainer calls run_evaluation
where evaluator
runs on and once it is done (completed) it calls log_generated_images
via tb_logger
.
can capture variables like lambda expressions.
Yes, I think you use any variables in these functions from your global scope. If you want to pass explicitly an argument you can do something like:
another_lambda = lambda : "check another lambda"
@trainer.on(Events.EPOCH_STARTED, lambda : "check lambda")
def run_evaluation(engine, fn):
print(fn(), another_lambda())
Hi, @vfdev-5 Thank you for the very clear explanation. By the way, the learning was interrupted by the following message in the middle of the learning .
2022-05-18 01:25:03,792 ignite.handlers.terminate_on_nan.TerminateOnNan WARNING: TerminateOnNan: Output '{'loss_generators': nan, 'loss_generator_a2b': nan, 'loss_generator_b2a': 1.1836220026016235, 'loss_discriminators': 0.06603499501943588, 'loss_discriminator_a': 0.10396641492843628, 'loss_discriminator_b': 0.028103578835725784}' contains NaN or Inf. Stop training
TerminateOnNan: Output '{'loss_generators': nan, 'loss_generator_a2b': nan, 'loss_generator_b2a': 1.1836220026016235, 'loss_discriminators': 0.06603499501943588, 'loss_discriminator_a': 0.10396641492843628, 'loss_discriminator_b': 0.028103578835725784}' contains NaN or Inf. Stop training
State:
iteration: 1878924
epoch: 81
epoch_length: 23250
max_epochs: 200
max_iters: <class 'NoneType'>
output: <class 'dict'>
batch: <class 'dict'>
metrics: <class 'dict'>
dataloader: <class 'torch.utils.data.dataloader.DataLoader'>
seed: <class 'NoneType'>
times: <class 'dict'>
By the way, is the TerminateOnNan flag a function to suppress over-learning? Also, if this flag is true, is there any point in learning any more? Or, if I increase the number of cases, will I be able to turn more epochs? Does mixed-precision learning also have any positive effects?
Sorry for all the questions. Regards.
I don't know if this is relevant, but I had to prepare and learning Dataset on my own. I was learning with brain MRI images, but it seems that the type of images for the training and test were different. Is it possible that learning stops early in such cases?
Hi @kazuma0606
By the way, is the TerminateOnNan flag a function to suppress over-learning?
When loss goes Nan, learning is not possible anymore as weights are Nan as well and we just waste resources. TerminateOnNan
handler helps to stop the training as Nan is encountered.
Loss can go Nan in various cases:
Or, if I increase the number of cases, will I be able to turn more epochs?
I'm not sure to understand your point here, sorry
Does mixed-precision learning also have any positive effects?
Yes, less GPU memory usage and faster training on Nvidia GPUs with Turing cores
I was learning with brain MRI images, but it seems that the type of images for the training and test were different. Is it possible that learning stops early in such cases?
I do not think that your data is responsible for Nan, try 2 above points before and see if it helps
❓ Questions/Help/Support
Hi, support teams. This is my first time asking a question. I believe the following code will load the checkpoints.
=================================================== checkpoint_path = "/tmp/cycle_gan_checkpoints/checkpoint_26500.pt"
===================================================
If the learning process takes a long time, there will be interruptions along the way. In such a case, what code can I use to resume learning? We look forward to hearing from you. Regards.