pytorch / ignite

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.

https://pytorch-ignite.ai

BSD 3-Clause "New" or "Revised" License

4.53k stars 617 forks source link

Training State Centered Framework vs. Engine Centered Architecture #810

Open DrStoop opened 4 years ago

DrStoop commented 4 years ago

Hi!

I was thinking for some time about the current architecture of Ignite and how to improve the workflow both, during application development (writing training scripts) and during feature development. When writing training scripts I could not code the individual training I wanted and when trying to code the feature for my training script I ended up writing infrastructure/architecture code instead of implementing the feature. There were some kind of restrictions I first couldn't identify...

Nevertheless, after a while I found 2 twists, actually nothing big... so here I want to come up with a under-the-hood-framework to integrate into Ignite that solved all my problems and many open issues! Nice right?

With this framework integrated into Ignite you achieve an extreme nice overview during debugging, enhance Ignite to a rapid feature dev tool, can handle far more complex (individual) use cases while achieving a higher degree of automation at the same time and have quite some new features and many more possibilities for more syntactic sugar.

But now comes the... BUT as I tried to fix it, unfortunately I had to realize it won't work without major revisions. For that I went a long way to really provide proof and facts - something you can play with to make up your opinion - before daring to suggest a major revision...

So, if you're interested in a up&running "what-if-when"-Ignite version with the 2 twists below untwisted, please have a look at the repository and the documentation and leave me a feedback - I'd really like to know your opinion.

In case enough of you like it and could imagine integrating the framework into Ignite, I could pull/request the code on an experimental branch and we see how it goes from there. (Note: I just pushed it to another repository because as far as i know you cannot pull/request a new branch - which this definitely needs.)

Everything else you need to know you will find in the Ignite Framework repo and the docu. For bugs & questions, let me know, thx!

So, set up your first coffee & enjoy playing!

Teaser from the documentation

Two issues

I am a fan of Ignite and that's why I'm trying to contribute, but I discovered 2 shortcomings in the architecture and the implementation, that caused me quite some restrictions and coding infrastructure instead of programming new features (what I actually wanted to do). The issues are:

Engine centered architecture: In current Ignite the Engine is the architectural center with the training state as attribute. The training state atttribute is a transient object that is only instantiated when the Engine is in run-mode and vanishes afterwards. Also the state holds only a selective fraction of all variables and parameters that make up the real training state. So Engine is a kind of static object and state is transient. This does not represent the reality of the training process. In reality the training starts with an initial state holding all variables, parameters including e.g. model variables, hyperparameters etc. which then are modified while the state goes through different transitions. The main transitions of the state are Engines (normally more than one). So the state should be the architectual center holding ALL variables, parameters, values, transitions etc. and the Engine is (just) the main trainsition of the state. This small twist causes quite some complications for features and APIs which are listed below.
Event is broken in many pieces: Currently an Event is an Enum that has to be explicitly fire_evented, and implicitly _fire_evented so the event_handlers handle further callbacks. The Event is always fired after some other training value has changed, e.g. the model output was updated ITERATION_COMPLETED is fired. Also if you want to fire a non-standard event, you first have to create it, register it at each Engine that is supposed to use it and then the firing has to be implemented... But in reality an Event is nothing more than a value change of a training state variable that triggers callbacks. So all these pieces above can be put together by implementing a state variable as a descriptor.

Improvements from an underlying framework

You will experience the improvements given by the framework when working on all 3 levels: application implementation, feature development and framework development. The separation of these working areas is already the first improvement. Try out the benifits in detail & hands-on for the first to levels in the Quickunderstanding Application and Quickunderstanding Feature Dev.

Before you go through the theoretically described enhancements these few no-comment-teasers of the training state in the debugger will give you nice insights what's ahead. It shows the Ignite example mnist_with_tensorboard.py transferred to the framework architecture just before the engines are started:

teaser_state_in_debugger

And here the engine state.engines.trainer unfolded:

teaser_engine_in_debugger

Or setting up all the below Tensorboard charts with these two simple comands:

# Automatically identify and generate metric chats comparing the different engines
EnginesMetricsComparisonCharts(x_axis_ref=state.trainer.n_samples_ref, n_identical_metric_name_suffixes=1)
# Automatically generate for each engine a summary of all metric charts
EnginesMetricsCharts(x_axes_refs=state.trainer.n_samples_ref, n_identical_metric_name_suffixes=1)

By the way, if you had set up 10x more metrics and some more engines, these two command would not change to provide all comparative and single metric charts of all engines.

teaser_tensorboard

Soooo, if you're intrested, then grab a coffee and press >>>PLAY<<<!

justusschock commented 4 years ago

Hi,

First of all: thank you very much for your work, this looks awesome!

I had a look at your code and what you did there is impressive. I have a few things to comment on:

1.) You are right, that it has some restrictions using transient state and I often had to implement some workarounds as well, but on the other hand, I really don't like the idea of a state that contains virtually everything. That way, your scope would be way to global and it will become hard to track which variable is used and set where (for the devs as well as for the users). So IMO we should opt for more entries in the state, but it should definitely not contain everything.

2.) Regarding the events: You're right, basically every value change could trigger some event. This would, however, require lots of implicit events, and for most cases you probably don't need them. In your training loop, you usually just have a few necessary events (Start, End, Epoch Start/End, Iteration Start/End...) which are already predefined. I agree, that one could probably add some more, but for 99% of the cases these are enough. And also this follows Python's core philosophy that explicit is better than implicit.

3.) The engine is supposed the be trainer and validator (with the respective arguments) but should not hold them. It is supposed to be a kind of general purpose interface to have a tested way to iterate do the looping.

In general I really like your points (maybe not each part of the implementation, but that's okay I guess). So what I would ultimately propose:

We wait for the opinions of the other guys (cc @vfdev-5 @ykumards @anmolsjoshi ) and if they are okay, I would suggest to start with the actual implementation in the following way:

I) We (you and me together) will start implementing a richer state system, which may be less transient and hold more entries, but still won't be global

II) If we all agreed on that state, we will probably have a look at where we want to add additional hooks for events and if we somehow can simplify the process of registering them.

III) After this is done, we will most likely have to revisit your framework and look, what's still missing from there and if it makes sense to actually include this here.

For all of this I would suggest to add this in parallel to the existing interface to avoid breaking changes and to be consistent with the current API wherever we can. What do you think @DrStoop ?

vfdev-5 commented 4 years ago

@DrStoop thanks for such huge feature request ! There is a lot of points raised, a lot of text/code to understand, but I think it worth the effort !

I'm trying to figure out two issues you stated about the current implementation: 1) Engine centered vs State centered 2) Ingite's Events vs Variable triggered events

Concerning "Engine centered vs State centered", well, I understand your point in some sense, but state with command run looks semantically weird... I totally agree that a "training" process is about state evolution and for me it is Engine who drives this intead of the object itself... Ultimately, maybe, we need to have something that could build upon multiple engines...

I found this link : https://drstoop.github.io/ignite_framework/index.html#resulting-ignite-limitations helpful to understand your points.

I experienced limitations when working with more complex use cases, e.g. multiple engines (up to 3 very common) with customized events and metrics

Could you provide more details or a code snippet for this please?

PS: If you are in pytorch slack, we can discuss about this in #pytorch-ignite channel more fluently...

DrStoop commented 4 years ago

hi @vfdev-5,

didn‘t expect such a fast reply;) thx! although i was totally committed to ignite & it‘s concepts & features, there‘s still quite some difference which may take a bit to fully oversee.

but you‘re faster than me it seems, you first point state.run is already a good point i also stooped over. regarding the full concept i left it like that for the moment. issues like that are easily changed, i added it to my list.

there are definitely a lot of details & names and feature that for sure don‘t fit yet, but for the moment it’s mostly about the main concept usability. ah, plus, i just wanted to push the project (as incomplete as it is) to avoid total organizational blindness ;)

i‘ll catch up on this pyt-slack in a few hours & provide you more requested stuff:)

DrStoop commented 4 years ago

Hi @justusschock,

man, you guys are really fast! I was hoping to get some hours downtime for that much text and code ;-)

You hit the points I also was struggling with especially with 1.) you're pointing out this global state I really didn't feel good about in the beginning! But in conclusion, I excluded my worries. Find some of my thoughts below, that's definitely the main discussion point of this concept with pros & cons!

Your proceeding plan sounds good, but there is one thing I would avoid based on this PR, i.e. start changing Ignite or integrating this PR in any way too early: There is one question in this PR that has to be answered before proceeding with it:

Integrate framework or stick to current architecture?

This suggested framework may have a hidden show stopper I have overseen, or be a bad proposals for the current Ignite. But if we start coding in one direction without answering this question first, we may regret having wasted our time ;)

Personally, I implemented this framework based on 2 examples and learned quite a lot when I transferred the other existing Ignite examples. Even though I know the code I wanted to go for some applications first to compare it with current Ignite in different use cases (I even lost tracks on Ignite's lastest updates!) .

I would recommend you to try and code a training script or transfer a metric or other feature of current Ignite to this framework, that should give already a good insight. Especially ask me all those questions of the missing documentation and raise the complaints what su@!s. Maybe there's a solution already implemented for that or I'll learn a lesson.

This way or that way - the discussion is on either way! So find my comments below :)

Hi,

First of all: thank you very much for your work, this looks awesome!

I had a look at your code and what you did there is impressive. I have a few things to comment on:

1.) You are right, that it has some restrictions using transient state and I often had to implement some workarounds as well, but on the other hand, I really don't like the idea of a state that contains virtually everything. That way, your scope would be way to global and it will become hard to track which variable is used and set where (for the devs as well as for the users). So IMO we should opt for more entries in the state, but it should definitely not contain everything.

Right point! My thoughts about in bullets were:

what defines too big for a namescope?
- from user point of view it's all about overview & coding effort:
  - overview solved with state containers, automatic/manual callback descriptions, & I always have a small smile on my face during debugging, because it's so obvious where to find what :)
  - distributing the same amount of variables on different engines somewhere in the kernel namespace doesn't make them less, but you just find them harder, especially when they disappear here & then
  - shortcuts , automatic categorization, automatic/assisted naming reduce "linely"/daily coding efforts immense compared to current Ignite (e.g. if you just instantiated an Engine(), you look in state.engines)
- from hw/sw-view:
  - GIL locks, distributed & parallel programming has to be proven but I have a good feeling about it (don't believe me too much, we'll see)
  - amount of variables/objects: a training script is set up by humans, meaning some 100 objects/events as kind of graph, can this become critical? compared to a pytorch module with >10^6 variables/tensor/gradients or the TensorFlow graph, I would assume it's neglicable.
- from a feature point of view:
  - having the training loop in graph like structure is very useful to simply implement powerful features and tools, e.g. the dataflow graph analysis tool set (e.g. state.maintenance.dataflow.get_graphs_in_state() (short: state.get_graphs_in_state())). Next would be visualizing the time sequence of the training loop, and so on...

2.) Regarding the events: You're right, basically every value change could trigger some event. This would, however, require lots of implicit events, and for most cases you probably don't need them. In your training loop, you usually just have a few necessary events (Start, End, Epoch Start/End, Iteration Start/End...) which are already predefined. I agree, that one could probably add some more, but for 99% of the cases these are enough. And also this follows Python's core philosophy that explicit is better than implicit.

I agree on this point and that's why I have already implemented the solution for it:

The state object has many types starting from the most nonfunctional type StateObject that has no callback feature and also no type checking. It only integrates in the shortcut & referencing system. The callback functionality is based on the subclass FrameCallbacksStateObject (together with FrameCallbackoverloadsStateObject) which is part of framework implementation and not documented yet (ups!). You even have non-datadescriptor StateConstant (which may be inheriting too many subclass due to a (simple) bug I just realized). So the granularity is immense, cherry picking is here the task.
The number of current state objects with callbacks (with event character, e.g. StateVariable, StateParameter, StateDevice,...) may be to many in this version, but can easily be reduced (exchange output = StateVariable() with output = StateObject() - e.g. in BaseMetric => already 10% less, and so on, done!).
The granularity of events is currently underestimated by some features, e.g. tensorboard logger/charts in my opinion. Implement a multi-graph chart ScalarsChart to get a feeling for it. You have the args y_axes_refs and additional for each y_axis_ref you can (optionally) provide a caller_ref, i.e. a trigger event. If no caller_ref is provided the value change of each y_axis_ref is used independently from the others as log-trigger... plus: log as many different engines as you want... plus: you have x_axis_ref which is totally independent from any of the engines you are just logging (plus walltime_ref, same game!)... just one example. We'll see what makes sense...

3.) The engine is supposed the be trainer and validator (with the respective arguments) but should not hold them. It is supposed to be a kind of general purpose interface to have a tested way to iterate do the looping.

In general I really like your points (maybe not each part of the implementation, but that's okay I guess). So what I would ultimately propose:

We wait for the opinions of the other guys (cc @vfdev-5 @ykumards @anmolsjoshi ) and if they are okay, I would suggest to start with the actual implementation in the following way:

I) We (you and me together) will start implementing a richer state system, which may be less transient and hold more entries, but still won't be global

II) If we all agreed on that state, we will probably have a look at where we want to add additional hooks for events and if we somehow can simplify the process of registering them.

III) After this is done, we will most likely have to revisit your framework and look, what's still missing from there and if it makes sense to actually include this here.

For all of this I would suggest to add this in parallel to the existing interface to avoid breaking changes and to be consistent with the current API wherever we can.

As mentioned above, I am fine with any outcome, but I suggest a good discussion based on playing with the framework and comparing it with current Ignite, then decide how much & what to integrate from framework, then go for it!

What do you think @DrStoop ?

Please ask stuff or let me know about your issues. Does everything work? Bugs?

thx & greets!

DrStoop commented 4 years ago

hi @vfdev-5,

didn‘t expect such a fast reply;) thx! although i was totally committed to ignite & it‘s concepts & features, there‘s still quite some difference which may take a bit to fully oversee.

but you‘re faster than me it seems, you first point state.run is already a good point i also stooped over. regarding the full concept i left it like that for the moment. issues like that are easily changed, i added it to my list.

there are definitely a lot of details & names and feature that for sure don‘t fit yet, but for the moment it’s mostly about the main concept usability. ah, plus, i just wanted to push the project (as incomplete as it is) to avoid total organizational blindness ;)

i‘ll catch up on this pyt-slack in a few hours & provide you more requested stuff:)

hey, didn't know pytorch-slack is restricted to certain company email. i'm not working for any of them. (...& non of them belong to me either;-) )

justusschock commented 4 years ago

I really think the best way is to move forward and start implementing parts of your proposals. I'm sure we'll encounter some problems during that, but I think this is the only way to really get an overview here. Regarding the namespace: I think we should carefully think about what we're adding. So I would not like to add everything just because we can. But I agree that some stuff would be beneficial. But I think these are too many details to discuss them all without an implementation proposal.

And this is also why I suggested the implementation order above: we start with the easy things that would most likely be a good idea and move on to the things we need to discuss base on these implementation details

DrStoop commented 4 years ago

Okay @justusschock, this is for sure a way to get through it. I'll try to assist setting up an implementation plan by sorting your points (and adding maybe a point here & there) in a way to implement, starting with the major changes:

Should state be transient?
state vs. Engine.state scope?
What's the right scope for a state?
"Stick Events together"
1-on-1 Feature-for-Task

1. Should `state` be transient?

Is there a real reason to let Engine.state disappear? It's instantiated with no dynamic values or depends on anything special as far as I understand:

   # Snippet from ignite.engine.Engine
    def load_state_dict(self, state_dict: Mapping) -> None:
        """Setups engine from `state_dict`.

...
...
        self.state = State(
            seed=state_dict["seed"],
            max_epochs=state_dict["max_epochs"],
            epoch_length=state_dict["epoch_length"],
            metrics={},
        )

Possible implementation: This may be something that could be changed quickly, right? I cannot find any pros for letting state disappear, or am I missing something? Just initialize self.state in Engine.__init__() and get ride of all the if engine.state is None: everywhere.

2. `state` vs. `Engine.state` scope?

There are 2 aspects involved:

What is the appropriate state scope? This is discussed in point 3 (not here in 2.).
What is the point of splitting up the state scope into the number of engines?

Whatever state scope we discuss in the next point, here we split it by 2 or 3 (or...), depending on the number of engine. The current Ignite Engine.state scope actually also includes all attributes of Engine, only difference is if they disappear or not:

Categorization by engines:
- Splitting the state creates redundant objects, which only make sense when they are actually a variable of Engine, e.g. state.n_iteration, state.n_epoch and so on. Some guys would maybe search for them directly in Engine.
- The only point about having metrics split apart is to tell the user, which metric is attached to which engine.
- In the framework this is done with assisted naming where the user is informed about the full preceding dataflow path of the metric, including engine and output handlers.
Redundancy without a reason:
- Next, there is Engine.state.metrics (which is actually metrics and output_handlers because they are entangled for some reason) in very Engine. So when you want to check your implemented metrics, you need to turn on all Engines (which normally does not happen) and then go through all Engine.metrics... in reality you "script-scroll".
- Further all Engines are holding loggers, which is also a kind of redundant state split.
another_engine:
- Ignite has features with arguments engine and another_engine, this really sounds to me like clue to stick something together that belongs together.
- What if you have a feature that needs 3 or more engines, e.g. a high-level Chart creator, that automatically logs all metrics created (i.e. EnginesMetricsComparisonCharts in the framework)?
- Maybe a sort of container including all engines to pass in would help.

Possible implementations regarding above:

Would Engine variables like n_iteration, n_epoch be better assigned directly to Engine?
Is categorizing the state scope by 2 or 3 engines useful with its redundant metrics/output_handlers. logger (and whatever comes in point 3), or should all metrics and output handlers and loggers be categorized together? Second choice is a major change in current Ignite.
Should custom events be attached dynamically to Events while created, so the user has more overview? Still they won't know for which engine they are allowed, right? I had already implemented something like this for the current Ignite, but didn't PR it, because I started working on the more completed solution.
Having some kind of object that holds all engines could be helpful.

3. What's the right scope for a `state`?

So currently the Engine.state + Engine scope (as said above, that's the current effective state scope) includes:
- engine: output. n_iterations, n_epoch, iteration, batch and so on
- metrics & output
- dataloader
- output handlers (implicitly in metrics)
- scattered stuff: observation, param_history, running_reward, timestep, and so one
The current scope is already quite big and does not differ too much from the proposed framework state.
The "scattered stuff" causes confusion and I would have to code-dig the context of them.
In the framework the state holds a scope that already does not differ too much from current Ignite state
In the framework the state scope looks larger mainly for the substructure that sorts and categorizes the objects, which is missing in current Ignite state resulting in more & more "scattered stuff".

Possible implementations regarding above:

Categorization/organization structure in state to sort the "scattered stuff", especailly when adding more and more over time.
Problem here: this structure would be redundant for every engine...

4. "Stick `Events` together"

Really confusion is the splitting and registering of Engine.allowed_events. So first, can you see somewhere all Events currently existing? Only the default Events, right? Then you check in each engine if it's "allowed" and register if not?
And when you attach a process to an event, you have to figure out what events exists (only possible by script-scrolling?) and then find out for which engine they are already "allowed" and then attach the process. The process is attached in Engine._event_handlers[event_name]. When looking at Engine._event_handler[event_name], Do you still remember where you implemented that event and for what?
In the framework you just choose from you debugger window state your desired "event"/state object check if there's a <state_object>_callback variable, maybe have a quick look at the other callback descriptions of already attached processes listed there & then write state.desired_state_object = process. The process pops up e.g. in the state.engines.trainer.n_iteration_completed_callbacks list with auto-generated description, done!

Possible implementations regarding above:

Puh!!! It's hard...ideas?

5. 1-on-1 Feature-for-Task

Entangling multiple features in one class causes code redundancies with all its bad consequences:

E.g. the output feature is integrated into metrics, tensorboard loggers, other logger probably, any more? So the data (e.g. model output) is passed to a feature (e.g. metric as reference engine). Inside this Ignite feature it is passed again to another Ignite feature. This messes up the feature hierarchy level, as metrics and output handlers are equal features. It creates a unnecessary substructure in the architecture.
There is only one reason for feature-in-feature architecture, i.e. when a Ignite feature is holding/using an external "feature" e.g. PyTorch DataLoader() or SummaryWriter().
In the framework each low-level feature only provides exactly one task whereas high-level features Pipelines and <Feature>Compositions are used to coordinate multiple low-level features to provide the user 1 API providing multiple tasks.

Possible implementations regarding above:

Split all features so they only provide exactly one Ignite-task:

For the beginning, extract all output handlers and connect them properly.
To provide a feature that does both, that's a pipeline feature, that has only the task to coordinate the low-level features, for example the to features output handler and metric.

Conclusion

So above are just some possible improvements extracted from the framework to implement in the current Ignite architecture. When you think about implementing some of them, this will affect more or less all classes and definitely will not be straight forward! On the other hand, the proposed framework solves all these problems already (and many more) and transferring classes is really straight forward, often simplifying the feature code... So above are the first improvement proposals, let's think about it :-)

vfdev-5 commented 4 years ago

hey, didn't know pytorch-slack is restricted to certain company email. i'm not working for any of them. (...& non of them belong to me either;-) )

@DrStoop please give me your email and I'll try to get you an invitation

vfdev-5 commented 4 years ago

Let's try to discuss about each point separately:

1. Should state be transient?

Why state is None at the begining and reset at each run ? Related questions https://github.com/pytorch/ignite/issues/297 and https://github.com/pytorch/ignite/pull/640

This is "historiacal" and sort of relies on a pattern we would like to adopt. Currently, we have the following logic:

- At the first call, new state is defined by `max_epochs`, `epoch_length`, `seed` if provided.
- If state is already defined such that there are iterations to run until `max_epochs` and no input arguments provided, state is kept and used in the function.
- If state is defined and engine is "done" (no iterations to run until `max_epochs`), a new state is defined.
- If state is defined, engine is NOT "done", then input arguments if provided override defined state.

This covers the cases:

trainer can run from 0 iteration to N iterations
trainer can resume from a checkpoint (K iteration) and goes until N iterations
evaluator can run 0 iteration to N iterations
evaluator can rerun from the begining without explicitly calling a sort of reset

As today, we added the logic to setup engine's state using load_state_dict, maybe we can reconsider Engine.state = None at __init__. However, we should not break covered use-cases.

vfdev-5 commented 4 years ago

@DrStoop ping on

I experienced limitations when working with more complex use cases, e.g. multiple engines (up to 3 very common) with customized events and metrics

Could you provide more details or a code snippet for this please?

justusschock commented 4 years ago

Okay, let me just say my thoughts on this:

1.) I agree, that the state must not be transient, if it does not hold things like the model and the optimizer. But we have to reset it properly.

2.) The point of splitting up the state is that in your run not all engines might always be related to one model. This is why we are aiming for most flexibility and thus the most local scope possible.

3.) We can have this scope as you propose, if we will have some stuff (like the data loaders) wrapped with weak refs.

Regarding 4. and 5.) I would really postpone this until we decided and implemented the parts 1.)-3.) Since this will most likely change otherwise.

DrStoop commented 4 years ago

Sure, my email: berr@gmx.de, real name if required: Maximilian Berr.... thx @vfdev-5 !

give me a sec on the multi engine snippets... coming up

vfdev-5 commented 4 years ago

@DrStoop I sent a request to send you an invitation. Probably, we need to wait until the morning in US/NYC zone...

DrStoop commented 4 years ago

@DrStoop ping on

I experienced limitations when working with more complex use cases, e.g. multiple engines (up to 3 very common) with customized events and metrics

Could you provide more details or a code snippet for this please?

@vfdev-5 regarding 3 engines:

A simple example for the metrics of 3 engines compared in one chart per metric type: mnist_with_tensorboard_logger_and_low_level_apis.py

You have 3 engines state.trainer (check state.engines.trainer), state.xvalidator & state.evaluator with different process functions, dataloaders (and if you wanted, also different metrics, but here they're the same):

state.trainer: process=update, dataloader=state.trainer_loader for training the model
state.xvalidator: process=infer, dataloader=state.xvalidator_loader with only 200 inference-samples, e.g. for early stopping (not implemented in this example)
state.evaluator: process=infer, dataloader=state.evaluator_loader testing the final training result on test data

Snippet from example:

### ENGONES ###
#==============

# Trainer
# NOTE:
# - Below the maximum setting of default values is used.
# - The `Engine` is assigned to `state.engines.trainer`.
# - The argument `engine_run_started_ref=None` mean that the calling state object is not set, so `Engine` will
#   NOT be called. (Default values are always placeholdered by ``''``)
Engine(name='trainer',
       process=state._update, dataloader=state.trainer_loader,
       engine_run_started_ref=state.state_run_started_ref)

# X-Validator
# NOTE:
# . To parametrize the engine starting callback use:
#   `engine_run_started_ref=state.trainer.get('n_samples_every', state.n_xcal_step_smaples, 'ref')`
Engine(name='xvalidator',
       process=state._infer,
       dataloader=state.xvalidator_loader,
       engine_run_started_ref=state.trainer.get('n_samples_every', state.n_xval_step_samples, 'ref'),
       n_samples=state.n_xval_samples)

# Evaluator
# NOTE: Here is demonstrated how arguments can also be skipped during initialization and provided afterwards,
#       namely `dataloader` and `engine_run_started_ref`
Engine(name='evaluator',
       process=state._infer,
       n_epochs=state.n_eval_epochs)
# Setting up `state.evaluator`'s dataloader and callback after initialization
state.trainer.engine_run_completed = ('starte manually appended `state.evaluator.run` with `state.evaluator_loader',
                                     state.evaluator.run, {'dataloader': state.evaluator_loader})

Now, the following is an API you could not implement in current Ignite without using another_engine and a_third_engine, because features normally only reference one engine, or am I wrong? For the comparative multi-graph charts in this script you'll find (snippet from the example):

# Accuracy charts of all engines, 3 metrics in 1 chart
# NOTE: Here the maximum default values are used, setting as little as neccessary
# EXTRA NOTE: Leaving out the `chart_name` in charts with many `y_axes_refs` usually leads to very longish chart names
ScalarsChart(x_axis_ref=state.trainer.n_samples_ref,
             y_axes_refs=[state.get(trainer_accuracy_name).output_ref,
                          state.get(xvalidator_accuracy_name).output_ref,
                          state.get(evaluator_accuracy_name).output_ref],
             caller_refs=[state.get(trainer_accuracy_name).output_ref,
                          state.xvalidator.engine_run_completed_ref,
                          state.evaluator.engine_run_completed_ref])

# Loss charts of all engines, 3 metics in 1 chart
# NOTE:
# - Here maximum manual indivializations are called, some arguments are redundent in the way that they
#   equal (or behave equally to) the default value, e.g. the `caller_refs` would result in same chart.
# - The `walltime_ref` uses the state timer `state.maintenance.state_timer.state_run_started_time_in_sec`
ScalarsChart(x_axis_ref=state.trainer.n_samples_ref,
             y_axes_refs=[state.get(trainer_loss_name).output_ref,
                         state.get(xvalidator_loss_name).output_ref,
                         state.get(evaluator_loss_name).output_ref],
             caller_refs=[state.trainer.n_samples_every_100_ref,
                          state.xvalidator.engine_run_completed_ref,
                          state.evaluator.engine_run_completed_ref],
             y_names=['trainer_loss','xvalidator_loss', 'evaluator_loss'],
             chart_name='engines_loss_comparison',
             bso_ctr_name_suffix='manually_set_chart',
             summary_writer=state.tensorboard.summary_writer,
             summary_description='manually written description for summary adding',
             walltime_ref=state.state_run_started_time_in_sec_ref)

This results in the tensorboard charts posted above (where I should mention that the upper charts evaluator_trainer_xvalidator/accuracy & loss not yet plot the evaluator graph because I stop the engines before it was calculated). Note: This high-level code line below replaces all above ScalarsChart()s, and would do so also for any number of engines and metrics (snippet for here):

    # Automatically identify and generate metric chats comparing the different engines
    EnginesMetricsComparisonCharts(x_axis_ref=state.trainer.n_samples_ref, n_identical_metric_name_suffixes=1)

Now, you may argue that state.xvalidator and state.evaluator have the same process-function & could be fed with different dataloaders. Yes, that's right and you can also do that in the framework. Still you have 2 engines to deal with in a API for comparative metric charts. And normally you have different processes attatched to state.evaluator and state.xvalidator, i.e. state.xvalidator's loss metric may trigger the early stopping and modelcheckpointing and have different metrics compared to the final state.evaluator that only calculates the end success with a loss metrics.

Parallel/Distributed computing

When thinking about complex use cases, one idea for simple parallel/distributed computing would be to implement all required infrastructure into let's say a StateProcessPoolVariable that replaces the current (slightly modified) callbacks list with a multiprocessing.Pool (or a StateThreadPoolVariable for multiprocessing.pool.ThreadPool). You use this StateProcessPoolVariable for state.state_run_started and append all kinds of engines training in parallel. The user realises nothing about all that, she/he just initializes the different engines with Engine(..., engine_run_started_ref=state.state_run_started_ref,...) and finds all engines listed in state.engines. The parallelel/distributed state object state.engines.state_status.state_run_started will trigger all appended engines of its callbacks in a multiprocess.Pool (or ThreadPool) - everything is parallelized. In this case you will need tons of features that can handle multiple engines in parallel. Can you implement that in current Ignite? ...in the framework that is not implemented yet, but that is just doing... you could also just switch parallel/distributed computing on/off with a default switch in state.configs.default_configs.parallel_processing = True.

The nice thing about the metaclass architecture is, you can theoretically rewrite everything before starting the engines wihtout messing with the actual architecture. You can isolate tools like that from the rest in state plugins.

vfdev-5 commented 4 years ago

Actually, just realized that our more advanced examples have also 3 engines. For example,

https://github.com/pytorch/ignite/blob/master/examples/references/segmentation/pascal_voc2012/code/scripts/common_training.py#L87

In our cases similarly to your, 3 engines are trainer, evaluator on validation and evaluator on training subset. Two evaluators can not be merged in one simply as validation evaluator has more handlers: save best model on validation score, early stopping etc...

following is an API you could not implement in current Ignite without using another_engine and a_third_engine, because features normally only reference one engine, or am I wrong?

This feature is a shortcut for global_step_transform which can be configured as user would like and with any number of engines...

Parallel/Distributed computing

Well, sorry, it is a bit difficult to follow your explanations without a context. In my case, DDP work OK for 3 engines as in above example using torch.distributed.launcher. Such that whole code is running in a process and the noly synchronization on ignite's side should be done in metrics (reduction across the processes of metric's ingredients).

I agree that if we try to code proces splitting inside a single python script spawning engines, metrics could be an issue as stated in this issue : https://github.com/pytorch/ignite/issues/623

DrStoop commented 4 years ago

Okay, let me just say my thoughts on this:

1.) I agree, that the state must not be transient, if it does not hold things like the model and the optimizer. But we have to reset it properly.

Correct, there are many ways to easily reset. With reseting in mind I implemented the StateObject.__delete__ to reset the current value to its initial_value. (Also because deleting a class descriptor from its class just does not make sense at all.)

In general you can implement it in 2 ways:

Either implement the reset method in each feature (or BaseFeatureType) specifically and append it to the according state object callbacks (e.g. state.trainer.engine_run_completed or state.state_run_completed) during __init__(), e.g.:
```
from ignite_framework.states import state
```

from ignite_framework.transitions import BaseTransition from ignite_framework.feature_dev_tools.state_objects import StateVariable

class NewFeatureWithResetEachStateRun(BaseTransition): a_hyperparam = StateVariable(initial_value=0.42)

def __init__(self, name):
    super().__init__(name=name)

    def reset(param_ref):
        del param_ref.caller_name

    state.engines.state_status.state_run_completed = ('reset `a_hyperparam`', reset, [self.a_hyperparam_ref])

if name == 'main':

NewFeatureWithResetEachStateRun(name='demo_resetter')

# Change hyperparameter value (Note: shortcut for `state.transitions.demo_resetter.a_hyperparam = 0.234...`)
state.demo_resetter.a_hyperparam = 0.234426523

print('Before state run completed: ' + str(state.demo_resetter.a_hyperparam))

# End state run (Note: shortcut for `state.engines.state_status.state_run_completed = True`)
state.state_run_completed = True

print('After state run completed: ' + str(state.demo_resetter.a_hyperparam))

The debugger overview results as:

![reset_feature_in_debugger](https://user-images.githubusercontent.com/19177740/75461096-ef196800-598a-11ea-8fc6-924263bae5c1.png)

The consol result:
```python
Before state run completed: 0.234426523
After state run completed: 0.42

Alternatively you could implement a tool/state plugin (probably found in state.maintenance) to reset e.g. all parameter scheduler found in state when state completes its run, also easy.

2.) The point of splitting up the state is that in your run not all engines might always be related to one model. This is why we are aiming for most flexibility and thus the most local scope possible.

Which engine uses which model is defined by the process function which again is unique for each engine in both current Ignite & the framework. So this is unrelated to the question of splitting up the state.

In fact, Ignite could be termed as too "chunky" in a way regarding its referencing method. This is why:

In terms of splitting state we are discussing about a splitting factor of 2 or 3 (engine.states vs. one framework state). That's not a big difference.
In terms of referencing, current Ignite always hands in everything, namely engine with about 20-30 objects (?) as an argument. But maybe only few are needed, e.g. engine.out in a metric. That's a huge chunk of unused objects!
In contrast the framework provides the StateObjectsReference-class which slices out a single state object from the complete state as a reference and provides it as the exactly needed argument. This is implemented in a pretty syntax:
```
# Create reference object with simple suffix `_ref` at the end of the state object name
>>> nicely_sliced_state_object_reference = state.state_run_started_ref
```

Call state object value with default name `caller_name`

nicely_sliced_state_object_reference.caller_name Out[3]: False

Full functionality of the original state object is transferred to the reference object

nicely_sliced_state_object_reference.caller_name = True nicely_sliced_state_object_reference.caller_name Out[5]: True nicely_sliced_state_object_reference.caller_name_callbacks Out[6]: [('start timer state.state_timer.state_run_started_in_sec/min/h', <bound method StateTimer.start_timer of <ignite_framework.states.state_plugins.state_timer.StateTimer object at 0x7f4921b726a0>>, [], {'timer_name': 'state_run_started'}), ('syncstate.configs_status.config_run_started', <function ... # and so on
3.) We can have this scope as you propose, if we will have some stuff (like the data loaders) wrapped with weak refs.

Good point!

Regarding 4. and 5.) I would really postpone this until we decided and implemented the parts 1.)-3.) Since this will most likely change otherwise.

DrStoop commented 4 years ago

Actually, just realized that our more advanced examples have also 3 engines. For example,
* https://github.com/pytorch/ignite/blob/master/examples/references/segmentation/pascal_voc2012/code/scripts/common_training.py#L87
In our cases similarly to your, 3 engines are trainer, evaluator on validation and evaluator on training subset. Two evaluators can not be merged in one simply as validation evaluator has more handlers: save best model on validation score, early stopping etc...

following is an API you could not implement in current Ignite without using another_engine and a_third_engine, because features normally only reference one engine, or am I wrong?

This feature is a shortcut for global_step_transform which can be configured as user would like and with any number of engines...

okay, is this what global_step_transform is doing?

trainer is used as a global object
global_step_transform is used as a reference object that slices out trainer.state.iteration and passes it into the OutputHandler?
so you have to implement the referencer each time you want a different step variable?

Comparison with the Ignite framework:

All this is done automatically in the framework with global_step_reference = state.trainer.n_iteration_completed_ref... works with arbitrary state.<state_obejct_name>_ref
only difference, the reference in the framework is fully loaded with all functionalities of the original state object.

Now compare the current Ignite:

# current Ignite
tb_logger = TensorboardLogger(log_dir="experiments/tb_logs")

def global_step_transform(*args, **kwargs):
    return trainer.state.iteration

tb_logger.attach(evaluator,
                 log_handler=OutputHandler(tag="validation",
                                           metrics=["nll", "accuracy"],
                                           global_step_transform=global_step_transform),
                 event_name=Events.EPOCH_COMPLETED)

And here the Ignite framework:

# Ignite framework
ScalarChart(x_axis_ref=state.trainer.n_iteration_completed_ref, y_axis_ref=state.evaluator_accuracy.output_ref, caller_refs=state.evaluator.engine_run_completed,                  walltime_ref=state.state_run_started_time_in_h_ref)

Difference noticable? In the framework I added a effective runtime of the state as walltime, that's not implemented in the current Ignite.

Parallel/Distributed computing

Well, sorry, it is a bit difficult to follow your explanations without a context. In my case, DDP work OK for 3 engines as in above example using torch.distributed.launcher. Such that whole code is running in a process and the noly synchronization on ignite's side should be done in metrics (reduction across the processes of metric's ingredients).

ah, I'll check... totally agree on not implementing anything already existing in PyTorch!

I agree that if we try to code proces splitting inside a single python script spawning engines, metrics could be an issue as stated in this issue : #623

pytorch / ignite

Training State Centered Framework vs. Engine Centered Architecture #810

Teaser from the documentation

Two issues

Improvements from an underlying framework

1. Should state be transient?

2. state vs. Engine.state scope?

3. What's the right scope for a state?

4. "Stick Events together"

5. 1-on-1 Feature-for-Task

Conclusion

1. Should state be transient?

A simple example for the metrics of 3 engines compared in one chart per metric type: mnist_with_tensorboard_logger_and_low_level_apis.py

Parallel/Distributed computing

Parallel/Distributed computing

Call state object value with default name caller_name

Full functionality of the original state object is transferred to the reference object

Parallel/Distributed computing

1. Should `state` be transient?

2. `state` vs. `Engine.state` scope?

3. What's the right scope for a `state`?

4. "Stick `Events` together"

Call state object value with default name `caller_name`