pytorch / ignite

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.
https://pytorch-ignite.ai
BSD 3-Clause "New" or "Revised" License
4.5k stars 608 forks source link

INFO: Resuming from iteration for provided data will fetch data until required iteration ... #3247

Open H4dr1en opened 2 months ago

H4dr1en commented 2 months ago

❓ Questions/Help/Support

I observed in the training logs the message that I don't understand, could you please clarify what happens here and why?

INFO: Resuming from iteration for provided data will fetch data until required iteration ...

This happens for all validation engines I have, that I create as follows:

        for valid_dataset_name, valid_engine in valid_engines.items():
            valid_loader = valid_loaders[valid_dataset_name]
            train_engine.add_event_handler(Events.EPOCH_COMPLETED, partial(valid_engine.run, data=valid_loader))

Note: I use DeterministicEngine for all engines (training and validation)

vfdev-5 commented 2 months ago

The message means that deterministic engine is trying to resume the run from some non-zero iteration. For deterministic engines we have to rewind dataloader up to the resuming iteration otherwise randomness state wont be fully respected (probably here there can be more context: https://pytorch.org/ignite/engine.html#dataflow-synchronization).

Given the code you provide, I would say this is more like a bug. Probably, valid_engine was stopped at some point without getting the full 1 epoch and then it was called to run again...

H4dr1en commented 2 months ago

Given the code you provide, I would say this is more like a bug.

Do you mean a bug in ignite or in my code?

Probably, valid_engine was stopped at some point without getting the full 1 epoch and then it was called to run again...

I am not stopping any of the valid_engines, they all run for a single full epoch of validation after each training epoch

vfdev-5 commented 2 months ago

Do you mean a bug in ignite or in my code?

Difficult to say like that. Is it possible that you could provide more code to repro the issue ?