pytorch / ignite

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.
https://pytorch-ignite.ai
BSD 3-Clause "New" or "Revised" License
4.52k stars 617 forks source link

Improvements for resuming from terminated engine #2669

Open vfdev-5 opened 2 years ago

vfdev-5 commented 2 years ago

Current Engine's behaviour when we resume the run from terminated state is the following:

from ignite.engine import Engine, Events
from ignite.utils import setup_logger, logging

def func(engine, batch):
    print(engine.state.epoch, engine.state.iteration, " | ", batch)

max_epochs = 4
data = range(10)
engine = Engine(func)
engine.logger = setup_logger("engine", level=logging.DEBUG)

@engine.on(Events.ITERATION_COMPLETED(once=14))
def terminate():
    print("-> terminate")
    engine.terminate()

engine.run(data, max_epochs=max_epochs)
engine.run(data, max_epochs=max_epochs)
print(engine.state.epoch, engine.state.iteration)
Output: ``` 2022-08-23 15:42:07,681 engine INFO: Engine run starting with max_epochs=4. 2022-08-23 15:42:07,682 engine DEBUG: 0 | 0, Firing handlers for event Events.STARTED 2022-08-23 15:42:07,683 engine DEBUG: 1 | 0, Firing handlers for event Events.EPOCH_STARTED 2022-08-23 15:42:07,683 engine DEBUG: 1 | 0, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,685 engine DEBUG: 1 | 0, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,687 engine DEBUG: 1 | 1, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,688 engine DEBUG: 1 | 1, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,688 engine DEBUG: 1 | 1, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,689 engine DEBUG: 1 | 1, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,690 engine DEBUG: 1 | 2, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,691 engine DEBUG: 1 | 2, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,692 engine DEBUG: 1 | 2, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,692 engine DEBUG: 1 | 2, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,693 engine DEBUG: 1 | 3, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,693 engine DEBUG: 1 | 3, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,695 engine DEBUG: 1 | 3, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,696 engine DEBUG: 1 | 3, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,697 engine DEBUG: 1 | 4, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,698 engine DEBUG: 1 | 4, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,698 engine DEBUG: 1 | 4, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,700 engine DEBUG: 1 | 4, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,701 engine DEBUG: 1 | 5, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,702 engine DEBUG: 1 | 5, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,703 engine DEBUG: 1 | 5, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,704 engine DEBUG: 1 | 5, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,706 engine DEBUG: 1 | 6, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,707 engine DEBUG: 1 | 6, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,708 engine DEBUG: 1 | 6, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,708 engine DEBUG: 1 | 6, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,709 engine DEBUG: 1 | 7, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,711 engine DEBUG: 1 | 7, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,712 engine DEBUG: 1 | 7, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,714 engine DEBUG: 1 | 7, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,714 engine DEBUG: 1 | 8, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,716 engine DEBUG: 1 | 8, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,717 engine DEBUG: 1 | 8, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,717 engine DEBUG: 1 | 8, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,719 engine DEBUG: 1 | 9, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,719 engine DEBUG: 1 | 9, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,720 engine DEBUG: 1 | 9, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,720 engine DEBUG: 1 | 9, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,721 engine DEBUG: 1 | 10, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,722 engine DEBUG: 1 | 10, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,722 engine DEBUG: 1 | 10, Firing handlers for event Events.EPOCH_COMPLETED 2022-08-23 15:42:07,723 engine INFO: Epoch[1] Complete. Time taken: 00:00:00.039 2022-08-23 15:42:07,723 engine DEBUG: 2 | 10, Firing handlers for event Events.EPOCH_STARTED 2022-08-23 15:42:07,724 engine DEBUG: 2 | 10, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,724 engine DEBUG: 2 | 10, Firing handlers for event Events.DATALOADER_STOP_ITERATION 2022-08-23 15:42:07,725 engine DEBUG: 2 | 10, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,725 engine DEBUG: 2 | 11, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,726 engine DEBUG: 2 | 11, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,727 engine DEBUG: 2 | 11, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,727 engine DEBUG: 2 | 11, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,728 engine DEBUG: 2 | 12, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,728 engine DEBUG: 2 | 12, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,729 engine DEBUG: 2 | 12, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,729 engine DEBUG: 2 | 12, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,730 engine DEBUG: 2 | 13, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,733 engine DEBUG: 2 | 13, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,734 engine DEBUG: 2 | 13, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,734 engine DEBUG: 2 | 13, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,735 engine DEBUG: 2 | 14, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,736 engine DEBUG: 2 | 14, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,737 engine INFO: Terminate signaled. Engine will stop after current iteration is finished. 2022-08-23 15:42:07,737 engine DEBUG: 2 | 14, Firing handlers for event Events.TERMINATE_SINGLE_EPOCH 2022-08-23 15:42:07,738 engine DEBUG: 2 | 14, Firing handlers for event Events.TERMINATE 2022-08-23 15:42:07,739 engine INFO: Epoch[2] Complete. Time taken: 00:00:00.015 2022-08-23 15:42:07,739 engine DEBUG: 2 | 14, Firing handlers for event Events.COMPLETED 2022-08-23 15:42:07,740 engine INFO: Engine run complete. Time taken: 00:00:00.058 2022-08-23 15:42:07,741 engine INFO: Engine run resuming from iteration 14, epoch 2 until 4 epochs 2022-08-23 15:42:07,741 engine DEBUG: 2 | 14, Firing handlers for event Events.STARTED 2022-08-23 15:42:07,742 engine DEBUG: 3 | 14, Firing handlers for event Events.EPOCH_STARTED 2022-08-23 15:42:07,743 engine DEBUG: 3 | 14, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,743 engine DEBUG: 3 | 14, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,744 engine DEBUG: 3 | 15, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,745 engine DEBUG: 3 | 15, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,746 engine DEBUG: 3 | 15, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,746 engine DEBUG: 3 | 15, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,747 engine DEBUG: 3 | 16, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,748 engine DEBUG: 3 | 16, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,748 engine DEBUG: 3 | 16, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,749 engine DEBUG: 3 | 16, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,749 engine DEBUG: 3 | 17, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,750 engine DEBUG: 3 | 17, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,750 engine DEBUG: 3 | 17, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,751 engine DEBUG: 3 | 17, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,751 engine DEBUG: 3 | 18, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,754 engine DEBUG: 3 | 18, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,755 engine DEBUG: 3 | 18, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,755 engine DEBUG: 3 | 18, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,756 engine DEBUG: 3 | 19, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,757 engine DEBUG: 3 | 19, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,758 engine DEBUG: 3 | 19, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,758 engine DEBUG: 3 | 19, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,759 engine DEBUG: 3 | 20, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,759 engine DEBUG: 3 | 20, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,761 engine DEBUG: 3 | 20, Firing handlers for event Events.EPOCH_COMPLETED 2022-08-23 15:42:07,761 engine INFO: Epoch[3] Complete. Time taken: 00:00:00.018 2022-08-23 15:42:07,762 engine DEBUG: 4 | 20, Firing handlers for event Events.EPOCH_STARTED 2022-08-23 15:42:07,762 engine DEBUG: 4 | 20, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,763 engine DEBUG: 4 | 20, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,763 engine DEBUG: 4 | 21, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,764 engine DEBUG: 4 | 21, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,764 engine DEBUG: 4 | 21, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,765 engine DEBUG: 4 | 21, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,766 engine DEBUG: 4 | 22, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,766 engine DEBUG: 4 | 22, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,767 engine DEBUG: 4 | 22, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,768 engine DEBUG: 4 | 22, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,769 engine DEBUG: 4 | 23, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,769 engine DEBUG: 4 | 23, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,770 engine DEBUG: 4 | 23, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,771 engine DEBUG: 4 | 23, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,771 engine DEBUG: 4 | 24, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,772 engine DEBUG: 4 | 24, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,773 engine DEBUG: 4 | 24, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,774 engine DEBUG: 4 | 24, Firing handlers for event Events.DATALOADER_STOP_ITERATION 2022-08-23 15:42:07,774 engine DEBUG: 4 | 24, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,775 engine DEBUG: 4 | 25, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,776 engine DEBUG: 4 | 25, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,776 engine DEBUG: 4 | 25, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,777 engine DEBUG: 4 | 25, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,778 engine DEBUG: 4 | 26, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,779 engine DEBUG: 4 | 26, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,780 engine DEBUG: 4 | 26, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,781 engine DEBUG: 4 | 26, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,781 engine DEBUG: 4 | 27, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,782 engine DEBUG: 4 | 27, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,783 engine DEBUG: 4 | 27, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,783 engine DEBUG: 4 | 27, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,784 engine DEBUG: 4 | 28, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,784 engine DEBUG: 4 | 28, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,785 engine DEBUG: 4 | 28, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,785 engine DEBUG: 4 | 28, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,786 engine DEBUG: 4 | 29, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,787 engine DEBUG: 4 | 29, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,787 engine DEBUG: 4 | 29, Firing handlers for event Events.GET_BATCH_STARTED 2022-08-23 15:42:07,788 engine DEBUG: 4 | 29, Firing handlers for event Events.GET_BATCH_COMPLETED 2022-08-23 15:42:07,788 engine DEBUG: 4 | 30, Firing handlers for event Events.ITERATION_STARTED 2022-08-23 15:42:07,789 engine DEBUG: 4 | 30, Firing handlers for event Events.ITERATION_COMPLETED 2022-08-23 15:42:07,790 engine DEBUG: 4 | 30, Firing handlers for event Events.EPOCH_COMPLETED 2022-08-23 15:42:07,791 engine INFO: Epoch[4] Complete. Time taken: 00:00:00.029 2022-08-23 15:42:07,791 engine DEBUG: 4 | 30, Firing handlers for event Events.COMPLETED 2022-08-23 15:42:07,792 engine INFO: Engine run complete. Time taken: 00:00:00.051 1 1 | 0 1 2 | 1 1 3 | 2 1 4 | 3 1 5 | 4 1 6 | 5 1 7 | 6 1 8 | 7 1 9 | 8 1 10 | 9 2 11 | 0 2 12 | 1 2 13 | 2 2 14 | 3 -> terminate 3 15 | 0 3 16 | 1 3 17 | 2 3 18 | 3 3 19 | 4 3 20 | 5 4 21 | 6 4 22 | 7 4 23 | 8 4 24 | 9 4 25 | 0 4 26 | 1 4 27 | 2 4 28 | 3 4 29 | 4 4 30 | 5 (4, 30) ```

We can see that the engine stopped at epoch 2 and when engine is resumed, epochs are counted from 3. The epoch 3 does not count 10 iterations and ends earlier (after 6 iterations).

Maybe, an expected behaviour while resuming could be to count epochs from 2 and count 10 iterations before switching to the epoch 4 ?

@sadra-barikbin what do you think ?

EDIT: a suggestion from https://github.com/pytorch/ignite/pull/2645#discussion_r952675098

It's due to iter_counter local variable (in run_once_on_dataset) which has been initialized as state.iteration % epoch_length (3) and causes engine to stop 7 batches later when iter_counter == epoch_length.

Hence, iter_counter should start with zero.

vfdev-5 commented 2 years ago

I agree with Sadra's comment here: https://github.com/pytorch/ignite/pull/2645#discussion_r952231156 that final iteration should be 34 == epoch_length * (max_epochs - 1) + (iteration_to_stop % epoch_length)