pytorch / ignite

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.
https://pytorch-ignite.ai
BSD 3-Clause "New" or "Revised" License
4.53k stars 617 forks source link

Training State Centered Framework vs. Engine Centered Architecture #810

Open DrStoop opened 4 years ago

DrStoop commented 4 years ago

Hi!

I was thinking for some time about the current architecture of Ignite and how to improve the workflow both, during application development (writing training scripts) and during feature development. When writing training scripts I could not code the individual training I wanted and when trying to code the feature for my training script I ended up writing infrastructure/architecture code instead of implementing the feature. There were some kind of restrictions I first couldn't identify...

Nevertheless, after a while I found 2 twists, actually nothing big... so here I want to come up with a under-the-hood-framework to integrate into Ignite that solved all my problems and many open issues! Nice right?

With this framework integrated into Ignite you achieve an extreme nice overview during debugging, enhance Ignite to a rapid feature dev tool, can handle far more complex (individual) use cases while achieving a higher degree of automation at the same time and have quite some new features and many more possibilities for more syntactic sugar.

But now comes the... BUT as I tried to fix it, unfortunately I had to realize it won't work without major revisions. For that I went a long way to really provide proof and facts - something you can play with to make up your opinion - before daring to suggest a major revision...

So, if you're interested in a up&running "what-if-when"-Ignite version with the 2 twists below untwisted, please have a look at the repository and the documentation and leave me a feedback - I'd really like to know your opinion.

In case enough of you like it and could imagine integrating the framework into Ignite, I could pull/request the code on an experimental branch and we see how it goes from there. (Note: I just pushed it to another repository because as far as i know you cannot pull/request a new branch - which this definitely needs.)

Everything else you need to know you will find in the Ignite Framework repo and the docu. For bugs & questions, let me know, thx!

So, set up your first coffee & enjoy playing!

Teaser from the documentation

Two issues

I am a fan of Ignite and that's why I'm trying to contribute, but I discovered 2 shortcomings in the architecture and the implementation, that caused me quite some restrictions and coding infrastructure instead of programming new features (what I actually wanted to do). The issues are:

Improvements from an underlying framework

You will experience the improvements given by the framework when working on all 3 levels: application implementation, feature development and framework development. The separation of these working areas is already the first improvement. Try out the benifits in detail & hands-on for the first to levels in the Quickunderstanding Application and Quickunderstanding Feature Dev.

Before you go through the theoretically described enhancements these few no-comment-teasers of the training state in the debugger will give you nice insights what's ahead. It shows the Ignite example mnist_with_tensorboard.py transferred to the framework architecture just before the engines are started:

teaser_state_in_debugger

And here the engine state.engines.trainer unfolded:

teaser_engine_in_debugger

Or setting up all the below Tensorboard charts with these two simple comands:

# Automatically identify and generate metric chats comparing the different engines
EnginesMetricsComparisonCharts(x_axis_ref=state.trainer.n_samples_ref, n_identical_metric_name_suffixes=1)
# Automatically generate for each engine a summary of all metric charts
EnginesMetricsCharts(x_axes_refs=state.trainer.n_samples_ref, n_identical_metric_name_suffixes=1)

By the way, if you had set up 10x more metrics and some more engines, these two command would not change to provide all comparative and single metric charts of all engines.

teaser_tensorboard

Soooo, if you're intrested, then grab a coffee and press >>>PLAY<<<!

justusschock commented 4 years ago

Hi,

First of all: thank you very much for your work, this looks awesome!

I had a look at your code and what you did there is impressive. I have a few things to comment on:

1.) You are right, that it has some restrictions using transient state and I often had to implement some workarounds as well, but on the other hand, I really don't like the idea of a state that contains virtually everything. That way, your scope would be way to global and it will become hard to track which variable is used and set where (for the devs as well as for the users). So IMO we should opt for more entries in the state, but it should definitely not contain everything.

2.) Regarding the events: You're right, basically every value change could trigger some event. This would, however, require lots of implicit events, and for most cases you probably don't need them. In your training loop, you usually just have a few necessary events (Start, End, Epoch Start/End, Iteration Start/End...) which are already predefined. I agree, that one could probably add some more, but for 99% of the cases these are enough. And also this follows Python's core philosophy that explicit is better than implicit.

3.) The engine is supposed the be trainer and validator (with the respective arguments) but should not hold them. It is supposed to be a kind of general purpose interface to have a tested way to iterate do the looping.

In general I really like your points (maybe not each part of the implementation, but that's okay I guess). So what I would ultimately propose:

We wait for the opinions of the other guys (cc @vfdev-5 @ykumards @anmolsjoshi ) and if they are okay, I would suggest to start with the actual implementation in the following way:

I) We (you and me together) will start implementing a richer state system, which may be less transient and hold more entries, but still won't be global

II) If we all agreed on that state, we will probably have a look at where we want to add additional hooks for events and if we somehow can simplify the process of registering them.

III) After this is done, we will most likely have to revisit your framework and look, what's still missing from there and if it makes sense to actually include this here.

For all of this I would suggest to add this in parallel to the existing interface to avoid breaking changes and to be consistent with the current API wherever we can. What do you think @DrStoop ?

vfdev-5 commented 4 years ago

@DrStoop thanks for such huge feature request ! There is a lot of points raised, a lot of text/code to understand, but I think it worth the effort !

I'm trying to figure out two issues you stated about the current implementation: 1) Engine centered vs State centered 2) Ingite's Events vs Variable triggered events

Concerning "Engine centered vs State centered", well, I understand your point in some sense, but state with command run looks semantically weird... I totally agree that a "training" process is about state evolution and for me it is Engine who drives this intead of the object itself... Ultimately, maybe, we need to have something that could build upon multiple engines...

I found this link : https://drstoop.github.io/ignite_framework/index.html#resulting-ignite-limitations helpful to understand your points.

I experienced limitations when working with more complex use cases, e.g. multiple engines (up to 3 very common) with customized events and metrics

Could you provide more details or a code snippet for this please?

PS: If you are in pytorch slack, we can discuss about this in #pytorch-ignite channel more fluently...

DrStoop commented 4 years ago

hi @vfdev-5,

didn‘t expect such a fast reply;) thx! although i was totally committed to ignite & it‘s concepts & features, there‘s still quite some difference which may take a bit to fully oversee.

but you‘re faster than me it seems, you first point state.run is already a good point i also stooped over. regarding the full concept i left it like that for the moment. issues like that are easily changed, i added it to my list.

there are definitely a lot of details & names and feature that for sure don‘t fit yet, but for the moment it’s mostly about the main concept usability. ah, plus, i just wanted to push the project (as incomplete as it is) to avoid total organizational blindness ;)

i‘ll catch up on this pyt-slack in a few hours & provide you more requested stuff:)

DrStoop commented 4 years ago

Hi @justusschock,

man, you guys are really fast! I was hoping to get some hours downtime for that much text and code ;-)

You hit the points I also was struggling with especially with 1.) you're pointing out this global state I really didn't feel good about in the beginning! But in conclusion, I excluded my worries. Find some of my thoughts below, that's definitely the main discussion point of this concept with pros & cons!

Your proceeding plan sounds good, but there is one thing I would avoid based on this PR, i.e. start changing Ignite or integrating this PR in any way too early: There is one question in this PR that has to be answered before proceeding with it:

Integrate framework or stick to current architecture?

This suggested framework may have a hidden show stopper I have overseen, or be a bad proposals for the current Ignite. But if we start coding in one direction without answering this question first, we may regret having wasted our time ;)

Personally, I implemented this framework based on 2 examples and learned quite a lot when I transferred the other existing Ignite examples. Even though I know the code I wanted to go for some applications first to compare it with current Ignite in different use cases (I even lost tracks on Ignite's lastest updates!) .

I would recommend you to try and code a training script or transfer a metric or other feature of current Ignite to this framework, that should give already a good insight. Especially ask me all those questions of the missing documentation and raise the complaints what su@!s. Maybe there's a solution already implemented for that or I'll learn a lesson.

This way or that way - the discussion is on either way! So find my comments below :)

Hi,

First of all: thank you very much for your work, this looks awesome!

I had a look at your code and what you did there is impressive. I have a few things to comment on:

1.) You are right, that it has some restrictions using transient state and I often had to implement some workarounds as well, but on the other hand, I really don't like the idea of a state that contains virtually everything. That way, your scope would be way to global and it will become hard to track which variable is used and set where (for the devs as well as for the users). So IMO we should opt for more entries in the state, but it should definitely not contain everything.

Right point! My thoughts about in bullets were:

2.) Regarding the events: You're right, basically every value change could trigger some event. This would, however, require lots of implicit events, and for most cases you probably don't need them. In your training loop, you usually just have a few necessary events (Start, End, Epoch Start/End, Iteration Start/End...) which are already predefined. I agree, that one could probably add some more, but for 99% of the cases these are enough. And also this follows Python's core philosophy that explicit is better than implicit.

I agree on this point and that's why I have already implemented the solution for it:

3.) The engine is supposed the be trainer and validator (with the respective arguments) but should not hold them. It is supposed to be a kind of general purpose interface to have a tested way to iterate do the looping.

In general I really like your points (maybe not each part of the implementation, but that's okay I guess). So what I would ultimately propose:

We wait for the opinions of the other guys (cc @vfdev-5 @ykumards @anmolsjoshi ) and if they are okay, I would suggest to start with the actual implementation in the following way:

I) We (you and me together) will start implementing a richer state system, which may be less transient and hold more entries, but still won't be global

II) If we all agreed on that state, we will probably have a look at where we want to add additional hooks for events and if we somehow can simplify the process of registering them.

III) After this is done, we will most likely have to revisit your framework and look, what's still missing from there and if it makes sense to actually include this here.

For all of this I would suggest to add this in parallel to the existing interface to avoid breaking changes and to be consistent with the current API wherever we can.

As mentioned above, I am fine with any outcome, but I suggest a good discussion based on playing with the framework and comparing it with current Ignite, then decide how much & what to integrate from framework, then go for it!

What do you think @DrStoop ?

Please ask stuff or let me know about your issues. Does everything work? Bugs?

thx & greets!

DrStoop commented 4 years ago

hi @vfdev-5,

didn‘t expect such a fast reply;) thx! although i was totally committed to ignite & it‘s concepts & features, there‘s still quite some difference which may take a bit to fully oversee.

but you‘re faster than me it seems, you first point state.run is already a good point i also stooped over. regarding the full concept i left it like that for the moment. issues like that are easily changed, i added it to my list.

there are definitely a lot of details & names and feature that for sure don‘t fit yet, but for the moment it’s mostly about the main concept usability. ah, plus, i just wanted to push the project (as incomplete as it is) to avoid total organizational blindness ;)

i‘ll catch up on this pyt-slack in a few hours & provide you more requested stuff:)

hey, didn't know pytorch-slack is restricted to certain company email. i'm not working for any of them. (...& non of them belong to me either;-) )

justusschock commented 4 years ago

I really think the best way is to move forward and start implementing parts of your proposals. I'm sure we'll encounter some problems during that, but I think this is the only way to really get an overview here. Regarding the namespace: I think we should carefully think about what we're adding. So I would not like to add everything just because we can. But I agree that some stuff would be beneficial. But I think these are too many details to discuss them all without an implementation proposal.

And this is also why I suggested the implementation order above: we start with the easy things that would most likely be a good idea and move on to the things we need to discuss base on these implementation details

DrStoop commented 4 years ago

Okay @justusschock, this is for sure a way to get through it. I'll try to assist setting up an implementation plan by sorting your points (and adding maybe a point here & there) in a way to implement, starting with the major changes:

  1. Should state be transient?
  2. state vs. Engine.state scope?
  3. What's the right scope for a state?
  4. "Stick Events together"
  5. 1-on-1 Feature-for-Task

1. Should state be transient?

Is there a real reason to let Engine.state disappear? It's instantiated with no dynamic values or depends on anything special as far as I understand:

   # Snippet from ignite.engine.Engine
    def load_state_dict(self, state_dict: Mapping) -> None:
        """Setups engine from `state_dict`.

...
...
        self.state = State(
            seed=state_dict["seed"],
            max_epochs=state_dict["max_epochs"],
            epoch_length=state_dict["epoch_length"],
            metrics={},
        )

Possible implementation: This may be something that could be changed quickly, right? I cannot find any pros for letting state disappear, or am I missing something? Just initialize self.state in Engine.__init__() and get ride of all the if engine.state is None: everywhere.

2. state vs. Engine.state scope?

There are 2 aspects involved:

Whatever state scope we discuss in the next point, here we split it by 2 or 3 (or...), depending on the number of engine. The current Ignite Engine.state scope actually also includes all attributes of Engine, only difference is if they disappear or not:

Possible implementations regarding above:

3. What's the right scope for a state?

Possible implementations regarding above:

4. "Stick Events together"

Possible implementations regarding above:

Puh!!! It's hard...ideas?

5. 1-on-1 Feature-for-Task

Entangling multiple features in one class causes code redundancies with all its bad consequences:

Possible implementations regarding above:

Split all features so they only provide exactly one Ignite-task:

Conclusion

So above are just some possible improvements extracted from the framework to implement in the current Ignite architecture. When you think about implementing some of them, this will affect more or less all classes and definitely will not be straight forward! On the other hand, the proposed framework solves all these problems already (and many more) and transferring classes is really straight forward, often simplifying the feature code... So above are the first improvement proposals, let's think about it :-)

vfdev-5 commented 4 years ago

hey, didn't know pytorch-slack is restricted to certain company email. i'm not working for any of them. (...& non of them belong to me either;-) )

@DrStoop please give me your email and I'll try to get you an invitation

vfdev-5 commented 4 years ago

Let's try to discuss about each point separately:

1. Should state be transient?

Why state is None at the begining and reset at each run ? Related questions https://github.com/pytorch/ignite/issues/297 and https://github.com/pytorch/ignite/pull/640

This is "historiacal" and sort of relies on a pattern we would like to adopt. Currently, we have the following logic:

- At the first call, new state is defined by `max_epochs`, `epoch_length`, `seed` if provided.
- If state is already defined such that there are iterations to run until `max_epochs` and no input arguments provided, state is kept and used in the function.
- If state is defined and engine is "done" (no iterations to run until `max_epochs`), a new state is defined.
- If state is defined, engine is NOT "done", then input arguments if provided override defined state.

This covers the cases:

As today, we added the logic to setup engine's state using load_state_dict, maybe we can reconsider Engine.state = None at __init__. However, we should not break covered use-cases.

vfdev-5 commented 4 years ago

@DrStoop ping on

I experienced limitations when working with more complex use cases, e.g. multiple engines (up to 3 very common) with customized events and metrics

Could you provide more details or a code snippet for this please?

justusschock commented 4 years ago

Okay, let me just say my thoughts on this:

1.) I agree, that the state must not be transient, if it does not hold things like the model and the optimizer. But we have to reset it properly.

2.) The point of splitting up the state is that in your run not all engines might always be related to one model. This is why we are aiming for most flexibility and thus the most local scope possible.

3.) We can have this scope as you propose, if we will have some stuff (like the data loaders) wrapped with weak refs.

Regarding 4. and 5.) I would really postpone this until we decided and implemented the parts 1.)-3.) Since this will most likely change otherwise.

DrStoop commented 4 years ago

Sure, my email: berr@gmx.de, real name if required: Maximilian Berr.... thx @vfdev-5 !

give me a sec on the multi engine snippets... coming up

vfdev-5 commented 4 years ago

@DrStoop I sent a request to send you an invitation. Probably, we need to wait until the morning in US/NYC zone...

DrStoop commented 4 years ago

@DrStoop ping on

I experienced limitations when working with more complex use cases, e.g. multiple engines (up to 3 very common) with customized events and metrics

Could you provide more details or a code snippet for this please?

@vfdev-5 regarding 3 engines:

A simple example for the metrics of 3 engines compared in one chart per metric type: mnist_with_tensorboard_logger_and_low_level_apis.py

You have 3 engines state.trainer (check state.engines.trainer), state.xvalidator & state.evaluator with different process functions, dataloaders (and if you wanted, also different metrics, but here they're the same):

Snippet from example:

### ENGONES ###
#==============

# Trainer
# NOTE:
# - Below the maximum setting of default values is used.
# - The `Engine` is assigned to `state.engines.trainer`.
# - The argument `engine_run_started_ref=None` mean that the calling state object is not set, so `Engine` will
#   NOT be called. (Default values are always placeholdered by ``''``)
Engine(name='trainer',
       process=state._update, dataloader=state.trainer_loader,
       engine_run_started_ref=state.state_run_started_ref)

# X-Validator
# NOTE:
# . To parametrize the engine starting callback use:
#   `engine_run_started_ref=state.trainer.get('n_samples_every', state.n_xcal_step_smaples, 'ref')`
Engine(name='xvalidator',
       process=state._infer,
       dataloader=state.xvalidator_loader,
       engine_run_started_ref=state.trainer.get('n_samples_every', state.n_xval_step_samples, 'ref'),
       n_samples=state.n_xval_samples)

# Evaluator
# NOTE: Here is demonstrated how arguments can also be skipped during initialization and provided afterwards,
#       namely `dataloader` and `engine_run_started_ref`
Engine(name='evaluator',
       process=state._infer,
       n_epochs=state.n_eval_epochs)
# Setting up `state.evaluator`'s dataloader and callback after initialization
state.trainer.engine_run_completed = ('starte manually appended `state.evaluator.run` with `state.evaluator_loader',
                                     state.evaluator.run, {'dataloader': state.evaluator_loader})

Now, the following is an API you could not implement in current Ignite without using another_engine and a_third_engine, because features normally only reference one engine, or am I wrong? For the comparative multi-graph charts in this script you'll find (snippet from the example):

# Accuracy charts of all engines, 3 metrics in 1 chart
# NOTE: Here the maximum default values are used, setting as little as neccessary
# EXTRA NOTE: Leaving out the `chart_name` in charts with many `y_axes_refs` usually leads to very longish chart names
ScalarsChart(x_axis_ref=state.trainer.n_samples_ref,
             y_axes_refs=[state.get(trainer_accuracy_name).output_ref,
                          state.get(xvalidator_accuracy_name).output_ref,
                          state.get(evaluator_accuracy_name).output_ref],
             caller_refs=[state.get(trainer_accuracy_name).output_ref,
                          state.xvalidator.engine_run_completed_ref,
                          state.evaluator.engine_run_completed_ref])

# Loss charts of all engines, 3 metics in 1 chart
# NOTE:
# - Here maximum manual indivializations are called, some arguments are redundent in the way that they
#   equal (or behave equally to) the default value, e.g. the `caller_refs` would result in same chart.
# - The `walltime_ref` uses the state timer `state.maintenance.state_timer.state_run_started_time_in_sec`
ScalarsChart(x_axis_ref=state.trainer.n_samples_ref,
             y_axes_refs=[state.get(trainer_loss_name).output_ref,
                         state.get(xvalidator_loss_name).output_ref,
                         state.get(evaluator_loss_name).output_ref],
             caller_refs=[state.trainer.n_samples_every_100_ref,
                          state.xvalidator.engine_run_completed_ref,
                          state.evaluator.engine_run_completed_ref],
             y_names=['trainer_loss','xvalidator_loss', 'evaluator_loss'],
             chart_name='engines_loss_comparison',
             bso_ctr_name_suffix='manually_set_chart',
             summary_writer=state.tensorboard.summary_writer,
             summary_description='manually written description for summary adding',
             walltime_ref=state.state_run_started_time_in_sec_ref)

This results in the tensorboard charts posted above (where I should mention that the upper charts evaluator_trainer_xvalidator/accuracy & loss not yet plot the evaluator graph because I stop the engines before it was calculated). Note: This high-level code line below replaces all above ScalarsChart()s, and would do so also for any number of engines and metrics (snippet for here):

    # Automatically identify and generate metric chats comparing the different engines
    EnginesMetricsComparisonCharts(x_axis_ref=state.trainer.n_samples_ref, n_identical_metric_name_suffixes=1)

Now, you may argue that state.xvalidator and state.evaluator have the same process-function & could be fed with different dataloaders. Yes, that's right and you can also do that in the framework. Still you have 2 engines to deal with in a API for comparative metric charts. And normally you have different processes attatched to state.evaluator and state.xvalidator, i.e. state.xvalidator's loss metric may trigger the early stopping and modelcheckpointing and have different metrics compared to the final state.evaluator that only calculates the end success with a loss metrics.

Parallel/Distributed computing

When thinking about complex use cases, one idea for simple parallel/distributed computing would be to implement all required infrastructure into let's say a StateProcessPoolVariable that replaces the current (slightly modified) callbacks list with a multiprocessing.Pool (or a StateThreadPoolVariable for multiprocessing.pool.ThreadPool). You use this StateProcessPoolVariable for state.state_run_started and append all kinds of engines training in parallel. The user realises nothing about all that, she/he just initializes the different engines with Engine(..., engine_run_started_ref=state.state_run_started_ref,...) and finds all engines listed in state.engines. The parallelel/distributed state object state.engines.state_status.state_run_started will trigger all appended engines of its callbacks in a multiprocess.Pool (or ThreadPool) - everything is parallelized. In this case you will need tons of features that can handle multiple engines in parallel. Can you implement that in current Ignite? ...in the framework that is not implemented yet, but that is just doing... you could also just switch parallel/distributed computing on/off with a default switch in state.configs.default_configs.parallel_processing = True.

The nice thing about the metaclass architecture is, you can theoretically rewrite everything before starting the engines wihtout messing with the actual architecture. You can isolate tools like that from the rest in state plugins.

vfdev-5 commented 4 years ago

Actually, just realized that our more advanced examples have also 3 engines. For example,

In our cases similarly to your, 3 engines are trainer, evaluator on validation and evaluator on training subset. Two evaluators can not be merged in one simply as validation evaluator has more handlers: save best model on validation score, early stopping etc...

following is an API you could not implement in current Ignite without using another_engine and a_third_engine, because features normally only reference one engine, or am I wrong?

This feature is a shortcut for global_step_transform which can be configured as user would like and with any number of engines...

Parallel/Distributed computing

Well, sorry, it is a bit difficult to follow your explanations without a context. In my case, DDP work OK for 3 engines as in above example using torch.distributed.launcher. Such that whole code is running in a process and the noly synchronization on ignite's side should be done in metrics (reduction across the processes of metric's ingredients).

I agree that if we try to code proces splitting inside a single python script spawning engines, metrics could be an issue as stated in this issue : https://github.com/pytorch/ignite/issues/623

DrStoop commented 4 years ago

Okay, let me just say my thoughts on this:

1.) I agree, that the state must not be transient, if it does not hold things like the model and the optimizer. But we have to reset it properly.

Correct, there are many ways to easily reset. With reseting in mind I implemented the StateObject.__delete__ to reset the current value to its initial_value. (Also because deleting a class descriptor from its class just does not make sense at all.)

In general you can implement it in 2 ways:

from ignite_framework.transitions import BaseTransition from ignite_framework.feature_dev_tools.state_objects import StateVariable

class NewFeatureWithResetEachStateRun(BaseTransition): a_hyperparam = StateVariable(initial_value=0.42)

def __init__(self, name):
    super().__init__(name=name)

    def reset(param_ref):
        del param_ref.caller_name

    state.engines.state_status.state_run_completed = ('reset `a_hyperparam`', reset, [self.a_hyperparam_ref])

if name == 'main':

NewFeatureWithResetEachStateRun(name='demo_resetter')

# Change hyperparameter value (Note: shortcut for `state.transitions.demo_resetter.a_hyperparam = 0.234...`)
state.demo_resetter.a_hyperparam = 0.234426523

print('Before state run completed: ' + str(state.demo_resetter.a_hyperparam))

# End state run (Note: shortcut for `state.engines.state_status.state_run_completed = True`)
state.state_run_completed = True

print('After state run completed: ' + str(state.demo_resetter.a_hyperparam))
The debugger overview results as:

![reset_feature_in_debugger](https://user-images.githubusercontent.com/19177740/75461096-ef196800-598a-11ea-8fc6-924263bae5c1.png)

The consol result:
```python
Before state run completed: 0.234426523
After state run completed: 0.42

2.) The point of splitting up the state is that in your run not all engines might always be related to one model. This is why we are aiming for most flexibility and thus the most local scope possible.

Which engine uses which model is defined by the process function which again is unique for each engine in both current Ignite & the framework. So this is unrelated to the question of splitting up the state.

In fact, Ignite could be termed as too "chunky" in a way regarding its referencing method. This is why:

Call state object value with default name caller_name

nicely_sliced_state_object_reference.caller_name Out[3]: False

Full functionality of the original state object is transferred to the reference object

nicely_sliced_state_object_reference.caller_name = True nicely_sliced_state_object_reference.caller_name Out[5]: True nicely_sliced_state_object_reference.caller_name_callbacks Out[6]: [('start timer state.state_timer.state_run_started_in_sec/min/h', <bound method StateTimer.start_timer of <ignite_framework.states.state_plugins.state_timer.StateTimer object at 0x7f4921b726a0>>, [], {'timer_name': 'state_run_started'}), ('syncstate.configs_status.config_run_started', <function ... # and so on

3.) We can have this scope as you propose, if we will have some stuff (like the data loaders) wrapped with weak refs.

Good point!

Regarding 4. and 5.) I would really postpone this until we decided and implemented the parts 1.)-3.) Since this will most likely change otherwise.

DrStoop commented 4 years ago

Actually, just realized that our more advanced examples have also 3 engines. For example,

* https://github.com/pytorch/ignite/blob/master/examples/references/segmentation/pascal_voc2012/code/scripts/common_training.py#L87

In our cases similarly to your, 3 engines are trainer, evaluator on validation and evaluator on training subset. Two evaluators can not be merged in one simply as validation evaluator has more handlers: save best model on validation score, early stopping etc...

following is an API you could not implement in current Ignite without using another_engine and a_third_engine, because features normally only reference one engine, or am I wrong?

This feature is a shortcut for global_step_transform which can be configured as user would like and with any number of engines...

okay, is this what global_step_transform is doing?

Comparison with the Ignite framework:

Now compare the current Ignite:

# current Ignite
tb_logger = TensorboardLogger(log_dir="experiments/tb_logs")

def global_step_transform(*args, **kwargs):
    return trainer.state.iteration

tb_logger.attach(evaluator,
                 log_handler=OutputHandler(tag="validation",
                                           metrics=["nll", "accuracy"],
                                           global_step_transform=global_step_transform),
                 event_name=Events.EPOCH_COMPLETED)

And here the Ignite framework:

# Ignite framework
ScalarChart(x_axis_ref=state.trainer.n_iteration_completed_ref, y_axis_ref=state.evaluator_accuracy.output_ref, caller_refs=state.evaluator.engine_run_completed,                  walltime_ref=state.state_run_started_time_in_h_ref)

Difference noticable? In the framework I added a effective runtime of the state as walltime, that's not implemented in the current Ignite.

Parallel/Distributed computing

Well, sorry, it is a bit difficult to follow your explanations without a context. In my case, DDP work OK for 3 engines as in above example using torch.distributed.launcher. Such that whole code is running in a process and the noly synchronization on ignite's side should be done in metrics (reduction across the processes of metric's ingredients).

ah, I'll check... totally agree on not implementing anything already existing in PyTorch!

I agree that if we try to code proces splitting inside a single python script spawning engines, metrics could be an issue as stated in this issue : #623