pytorch / ignite

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.
https://pytorch-ignite.ai
BSD 3-Clause "New" or "Revised" License
4.55k stars 619 forks source link

Engine - Continue run throws error because epoch_length is None #1331

Open alxlampe opened 4 years ago

alxlampe commented 4 years ago

🐛 Bug description

I am using an engine such that it runs a previously undefined number of iterations with an infinite iterator. Epoch will be terminated in the process function with engine.terminate().

I am always running the engine for only one epoch, because I need to do some calculations outside the engines framework and then continue the engine to run one further epoch with engine.run(itertools.counter(start=0), max_epochs=engine.state.epoch + 1).

I get an error in engine._is_done because engine.state.epoch_length is None. Is this the intended behavior? Do I have to set epoch_length explicitly?

Here is a minimum example to reproduce the error:

import itertools

from ignite.engine import Engine

def _run_iteration(engine, counter):
    if counter > 2:
        # same behavior observed for engine.terminate_epoch()
        engine.terminate()

if __name__ == '__main__':
    engine = Engine(_run_iteration)
    # run one epoch
    engine.run(itertools.count(start=0), max_epochs=engine.state.epoch + 1)
    # epoch length is None after run:
    print(engine.state.epoch_length)
    # running another epoch throws the error:
    engine.run(itertools.count(start=0), max_epochs=engine.state.epoch + 1)

Thanks in advance!

Environment

vfdev-5 commented 4 years ago

@alxlampe thanks for reporting. I think the issue is related to https://github.com/pytorch/ignite/issues/1259 and if you set max_epochs=None between runs you can restart the engine:

    engine = Engine(_run_iteration)
    # run one epoch
    engine.run(itertools.count(start=0), max_epochs=engine.state.epoch + 1)
    # epoch length is None after run:
    print(engine.state)
    # running another epoch throws the error:
    engine.state.max_epochs = None
    engine.run(itertools.count(start=0), max_epochs=engine.state.epoch + 1)

I agree that the issue we see here is a bug and we should handle it in a user friendly way.

I was wondering why you have to restart the engine like that ? Is it possible to use engine.terminate_epoch and execute "some calculations outside the engines framework" in a handler attached to TERMINATE_SINGLE_EPOCH event ?

alxlampe commented 4 years ago

@vfdev-5 Thanks for your fast reply and your proposal.

First to answer your questions.

I was wondering why you have to restart the engine like that ?

I am using nested engines. One engine drives the execution of multiple engines (tasks). In my case, data that is passed to my process function is a list of tasks. Some details:

In the optimal case, Engine would provide me some interface to split the run method into subcommands, e.g.

I was already thinking about creating a new issue to ask if it is possible to separate the event handler mechanics from the engine and build the engine on top it. That would allow to create fully customized engines, e.g. engines that only require an iteration loop but no epoch loop. But I still have to dive deeper into the code to check if this makes any sense and if so, would create an issue with an initial proposal :)

Is it possible to use engine.terminate_epoch and execute "some calculations outside the engines framework" in a handler attached to TERMINATE_SINGLE_EPOCH event ?

In my case, I don't see the benefit of using TERMINATE_SINGLE_EPOCH instead of EPOCH_COMPLETED, because it doesn't matter, if engine.terminate_epoch was called or if the epoch ends without this call. I just recently switched to a newer version of ignite and the event wasn't present before. Perhaps I am missing something.

Coming back to your proposal to set engine.state.max_epochs = None before calling engine.run again. It's working but resets the epoch counter which I wanted to use for counting how often the task was executed. When I set engine.state.max_epochs = None I have to use max_epochs=1 for each call of engine.run otherwise for each run, the engine runs 1 epoch at the first call, 2 epochs at the second call, 3 epochs at the third cann etc. See example below:

    engine = Engine(_run_iteration)
    engine.state.persistent_across_runs = 0
    # run one epoch
    engine.run(itertools.count(start=0), max_epochs=engine.state.epoch + 1)
    # outputs "1"
    print(engine.state.persistent_across_runs)

    # run another epoch
    engine.state.max_epochs = None
    engine.run(itertools.count(start=0), max_epochs=engine.state.epoch + 1)
    # outputs "3" since engine runs 2 epochs in the second run
    print(engine.state.persistent_across_runs)

Since I am subclassing Engine I can simply override the _is_done method to only be taken into account if epoch_length is set.

vfdev-5 commented 4 years ago

@alxlampe thanks for the details !

Yes, that's true that engine.state.max_epochs = None is to restart the Engine from the begining and not continue. Currently, if we would like to continue, epoch_length is sort of implicitly required. It is a bug as in your case user can stop the execution during the first epoch of a run with epoch_length=None and epoch_length is not set... Let me think what can be done for that. Maybe, a solution is to set unknown epoch_length on engine.terminate_epoch() or engine.terminate()...

I was already thinking about creating a new issue to ask if it is possible to separate the event handler mechanics from the engine and build the engine on top it. That would allow to create fully customized engines, e.g. engines that only require an iteration loop but no epoch loop. But I still have to dive deeper into the code to check if this makes any sense and if so, would create an issue with an initial proposal :)

Yes, this is something to explore

In the optimal case, Engine would provide me some interface to split the run method into subcommands ...

A refactor like that may also make sense... Let's see what is more interesting: this point or above one.

vfdev-5 commented 4 years ago

@fco-dv could you please take a look at this issue and propose a way to fix it along the way with https://github.com/pytorch/ignite/pull/1333 which actually shouldn't introduce state.restart as it was discussed in the corresponding issue.

vfdev-5 commented 4 years ago

@fco-dv any thoughts/updates on this ?