pytorch / ignite

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.
https://pytorch-ignite.ai
BSD 3-Clause "New" or "Revised" License
4.55k stars 620 forks source link

Epoch is implicitly incremented if terminated on iteration #1386

Open vfdev-5 opened 4 years ago

vfdev-5 commented 4 years ago

πŸ› Bug description

Below code shows the error:

from ignite.engine import Engine, Events
from ignite.utils import setup_logger

stop_iter = 50
epoch_length = 100
max_epochs = 2

trainer = Engine(lambda e, b: print(b, end=" "))
trainer.logger = setup_logger("trainer")
state = trainer.state

@trainer.on(Events.ITERATION_COMPLETED(every=stop_iter))
def stop():
    print("--> stop at {}".format(trainer.state.iteration))
    trainer.terminate()

data = list(range(epoch_length))

print("- Start from {} iteration".format(state.iteration))
state = trainer.run(data, max_epochs=max_epochs, epoch_length=epoch_length)
print("- Ended on {} iteration | {} epoch".format(state.iteration, state.epoch))

print("-- Do something else")

print("- Continue from {} iteration".format(state.iteration))
state = trainer.run(data, max_epochs=max_epochs, epoch_length=epoch_length)
print("- Ended on {} iteration | {} epoch".format(state.iteration, state.epoch))

print("-- Do something else")

print("- Continue from {} iteration".format(state.iteration))
state = trainer.run(data, max_epochs=max_epochs, epoch_length=epoch_length)
print("- Ended on {} iteration | {} epoch".format(state.iteration, state.epoch))

print("-- Do something else")

print("- Continue from {} iteration".format(state.iteration))
state = trainer.run(data, max_epochs=max_epochs, epoch_length=epoch_length)
print("- Ended on {} iteration | {} epoch".format(state.iteration, state.epoch))

print("-- Do something else")

print("- Continue from {} iteration".format(state.iteration))
state = trainer.run(data, max_epochs=max_epochs, epoch_length=epoch_length)
print("- Ended on {} iteration | {} epoch".format(state.iteration, state.epoch))

The issue is that iteration and epoch start to be unrelated which is a bug.

Environment

sparkingdark commented 3 years ago

@vfdev-5 is this solved, what is the error .... I got this while running

╭─ ο…Ό ξ‚° debo@pop-os ξ‚°   ~ ξ‚°   ξ‚² ο€Œ  ξ‚² 34.66s ο‰’  ξ‚² 5G   ξ‚² 1.30 ο‚€  ξ‚² 17:33:26 ο€—  
╰─ python test.py             
- Start from 0 iteration
2021-02-16 17:33:38,589 trainer INFO: Engine run starting with max_epochs=2.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 --> stop at 50
2021-02-16 17:33:38,591 trainer INFO: Terminate signaled. Engine will stop after current iteration is finished.
2021-02-16 17:33:38,591 trainer INFO: Epoch[1] Complete. Time taken: 00:00:00
2021-02-16 17:33:38,591 trainer INFO: Engine run complete. Time taken: 00:00:00
- Ended on 50 iteration | 1 epoch
-- Do something else
- Continue from 50 iteration
2021-02-16 17:33:38,591 trainer INFO: Engine run resuming from iteration 50, epoch 1 until 2 epochs
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 --> stop at 100
2021-02-16 17:33:38,593 trainer INFO: Terminate signaled. Engine will stop after current iteration is finished.
2021-02-16 17:33:38,593 trainer INFO: Epoch[2] Complete. Time taken: 00:00:00
2021-02-16 17:33:38,593 trainer INFO: Engine run complete. Time taken: 00:00:00
- Ended on 100 iteration | 2 epoch
-- Do something else
- Continue from 100 iteration
2021-02-16 17:33:38,593 trainer INFO: Engine run resuming from iteration 100, epoch 2 until 2 epochs
2021-02-16 17:33:38,593 trainer INFO: Engine run complete. Time taken: 00:00:00
- Ended on 100 iteration | 2 epoch
-- Do something else
- Continue from 100 iteration
2021-02-16 17:33:38,593 trainer INFO: Engine run resuming from iteration 100, epoch 2 until 2 epochs
2021-02-16 17:33:38,593 trainer INFO: Engine run complete. Time taken: 00:00:00
- Ended on 100 iteration | 2 epoch
-- Do something else
- Continue from 100 iteration
2021-02-16 17:33:38,593 trainer INFO: Engine run resuming from iteration 100, epoch 2 until 2 epochs
2021-02-16 17:33:38,593 trainer INFO: Engine run complete. Time taken: 00:00:00
- Ended on 100 iteration | 2 epoch
╭─ ο…Ό ξ‚° debo@pop-os ξ‚°   ~ ξ‚°    ξ‚² ο€Œ  ξ‚² 8.17s ο‰’  ξ‚² 5G   ξ‚² 1.32 ο‚€  ξ‚² 17:33:39 ο€—  
╰─ 
vfdev-5 commented 3 years ago

@sparkingdark there is no explicit error raised here, but epoch value is wrong. Here is a snippet with more explicit epoch check:

from ignite.engine import Engine, Events
from ignite.utils import setup_logger

stop_iter = 2
epoch_length = 15
max_epochs = 5

trainer = Engine(lambda e, b: print(b, end=" "))
trainer.logger = setup_logger("trainer")
state = trainer.state

@trainer.on(Events.ITERATION_COMPLETED(every=stop_iter))
def stop():
    print("--> stop at {}".format(trainer.state.iteration))
    trainer.terminate()

data = list(range(epoch_length))

print("- Start from {} iteration".format(state.iteration))
state = trainer.run(data, max_epochs=max_epochs, epoch_length=epoch_length)
print("- Ended on {} iteration | {} epoch".format(state.iteration, state.epoch))

print("-- Do something else")

print("- Continue from {} iteration".format(state.iteration))
state = trainer.run(data, max_epochs=max_epochs, epoch_length=epoch_length)
print("- Ended on {} iteration | {} epoch".format(state.iteration, state.epoch))

print("-- Do something else")

print("- Continue from {} iteration".format(state.iteration))
state = trainer.run(data, max_epochs=max_epochs, epoch_length=epoch_length)
print("- Ended on {} iteration | {} epoch".format(state.iteration, state.epoch))

print("-- Do something else")

assert state.epoch == 1, state.epoch

Also, note that we do not continue iterating the data but restart from the first samples which is wrong as well.

sparkingdark commented 3 years ago

Okay somehow need a fix which can resume from the current value. am i correct ?

vfdev-5 commented 3 years ago

Well, this is a bit complicated to fix as is. I think this will be done with Engine refactor that I'm initiated some time ago...

sparkingdark commented 3 years ago

Okay so am I try to solve it or look into other issues @vfdev-5

vfdev-5 commented 3 years ago

I'd suggest to see other "help wanted" issues: https://github.com/pytorch/ignite/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22