Open DrStoop opened 4 years ago
Hi,
First of all: thank you very much for your work, this looks awesome!
I had a look at your code and what you did there is impressive. I have a few things to comment on:
1.) You are right, that it has some restrictions using transient state and I often had to implement some workarounds as well, but on the other hand, I really don't like the idea of a state that contains virtually everything. That way, your scope would be way to global and it will become hard to track which variable is used and set where (for the devs as well as for the users). So IMO we should opt for more entries in the state, but it should definitely not contain everything.
2.) Regarding the events: You're right, basically every value change could trigger some event. This would, however, require lots of implicit events, and for most cases you probably don't need them. In your training loop, you usually just have a few necessary events (Start, End, Epoch Start/End, Iteration Start/End...) which are already predefined. I agree, that one could probably add some more, but for 99% of the cases these are enough. And also this follows Python's core philosophy that explicit is better than implicit.
3.) The engine is supposed the be trainer and validator (with the respective arguments) but should not hold them. It is supposed to be a kind of general purpose interface to have a tested way to iterate do the looping.
In general I really like your points (maybe not each part of the implementation, but that's okay I guess). So what I would ultimately propose:
We wait for the opinions of the other guys (cc @vfdev-5 @ykumards @anmolsjoshi ) and if they are okay, I would suggest to start with the actual implementation in the following way:
I) We (you and me together) will start implementing a richer state system, which may be less transient and hold more entries, but still won't be global
II) If we all agreed on that state, we will probably have a look at where we want to add additional hooks for events and if we somehow can simplify the process of registering them.
III) After this is done, we will most likely have to revisit your framework and look, what's still missing from there and if it makes sense to actually include this here.
For all of this I would suggest to add this in parallel to the existing interface to avoid breaking changes and to be consistent with the current API wherever we can. What do you think @DrStoop ?
@DrStoop thanks for such huge feature request ! There is a lot of points raised, a lot of text/code to understand, but I think it worth the effort !
I'm trying to figure out two issues you stated about the current implementation: 1) Engine centered vs State centered 2) Ingite's Events vs Variable triggered events
Concerning "Engine centered vs State centered", well, I understand your point in some sense, but state with command run
looks semantically weird...
I totally agree that a "training" process is about state evolution and for me it is Engine who drives this intead of the object itself... Ultimately, maybe, we need to have something that could build upon multiple engines...
I found this link : https://drstoop.github.io/ignite_framework/index.html#resulting-ignite-limitations helpful to understand your points.
I experienced limitations when working with more complex use cases, e.g. multiple engines (up to 3 very common) with customized events and metrics
Could you provide more details or a code snippet for this please?
PS: If you are in pytorch slack, we can discuss about this in #pytorch-ignite channel more fluently...
hi @vfdev-5,
didn‘t expect such a fast reply;) thx! although i was totally committed to ignite & it‘s concepts & features, there‘s still quite some difference which may take a bit to fully oversee.
but you‘re faster than me it seems, you first point state.run is already a good point i also stooped over. regarding the full concept i left it like that for the moment. issues like that are easily changed, i added it to my list.
there are definitely a lot of details & names and feature that for sure don‘t fit yet, but for the moment it’s mostly about the main concept usability. ah, plus, i just wanted to push the project (as incomplete as it is) to avoid total organizational blindness ;)
i‘ll catch up on this pyt-slack in a few hours & provide you more requested stuff:)
Hi @justusschock,
man, you guys are really fast! I was hoping to get some hours downtime for that much text and code ;-)
You hit the points I also was struggling with especially with 1.) you're pointing out this global state I really didn't feel good about in the beginning! But in conclusion, I excluded my worries. Find some of my thoughts below, that's definitely the main discussion point of this concept with pros & cons!
Your proceeding plan sounds good, but there is one thing I would avoid based on this PR, i.e. start changing Ignite or integrating this PR in any way too early: There is one question in this PR that has to be answered before proceeding with it:
Integrate framework or stick to current architecture?
This suggested framework may have a hidden show stopper I have overseen, or be a bad proposals for the current Ignite. But if we start coding in one direction without answering this question first, we may regret having wasted our time ;)
Personally, I implemented this framework based on 2 examples and learned quite a lot when I transferred the other existing Ignite examples. Even though I know the code I wanted to go for some applications first to compare it with current Ignite in different use cases (I even lost tracks on Ignite's lastest updates!) .
I would recommend you to try and code a training script or transfer a metric or other feature of current Ignite to this framework, that should give already a good insight. Especially ask me all those questions of the missing documentation and raise the complaints what su@!s. Maybe there's a solution already implemented for that or I'll learn a lesson.
This way or that way - the discussion is on either way! So find my comments below :)
Hi,
First of all: thank you very much for your work, this looks awesome!
I had a look at your code and what you did there is impressive. I have a few things to comment on:
1.) You are right, that it has some restrictions using transient state and I often had to implement some workarounds as well, but on the other hand, I really don't like the idea of a state that contains virtually everything. That way, your scope would be way to global and it will become hard to track which variable is used and set where (for the devs as well as for the users). So IMO we should opt for more entries in the state, but it should definitely not contain everything.
Right point! My thoughts about in bullets were:
state containers
, automatic/manual callback descriptions, & I always have a small smile on my face during debugging, because it's so obvious where to find what :)Engine()
, you look in state.engines
)state.maintenance.dataflow.get_graphs_in_state()
(short: state.get_graphs_in_state()
)). Next would be visualizing the time sequence of the training loop, and so on...2.) Regarding the events: You're right, basically every value change could trigger some event. This would, however, require lots of implicit events, and for most cases you probably don't need them. In your training loop, you usually just have a few necessary events (Start, End, Epoch Start/End, Iteration Start/End...) which are already predefined. I agree, that one could probably add some more, but for 99% of the cases these are enough. And also this follows Python's core philosophy that explicit is better than implicit.
I agree on this point and that's why I have already implemented the solution for it:
StateObject
that has no callback feature and also no type checking. It only integrates in the shortcut & referencing system. The callback functionality is based on the subclass FrameCallbacksStateObject
(together with FrameCallbackoverloadsStateObject
) which is part of framework implementation and not documented yet (ups!). You even have non-datadescriptor StateConstant
(which may be inheriting too many subclass due to a (simple) bug I just realized). So the granularity is immense, cherry picking is here the task. StateVariable
, StateParameter
, StateDevice
,...) may be to many in this version, but can easily be reduced (exchange output = StateVariable()
with output = StateObject()
- e.g. in BaseMetric
=> already 10% less, and so on, done!).ScalarsChart
to get a feeling for it. You have the args
y_axes_refs
and additional for each y_axis_ref
you can (optionally) provide a caller_ref
, i.e. a trigger event. If no caller_ref
is provided the value change of each y_axis_ref
is used independently from the others as log-trigger... plus: log as many different engines as you want... plus: you have x_axis_ref
which is totally independent from any of the engines you are just logging (plus walltime_ref
, same game!)... just one example. We'll see what makes sense...3.) The engine is supposed the be trainer and validator (with the respective arguments) but should not hold them. It is supposed to be a kind of general purpose interface to have a tested way to iterate do the looping.
In general I really like your points (maybe not each part of the implementation, but that's okay I guess). So what I would ultimately propose:
We wait for the opinions of the other guys (cc @vfdev-5 @ykumards @anmolsjoshi ) and if they are okay, I would suggest to start with the actual implementation in the following way:
I) We (you and me together) will start implementing a richer state system, which may be less transient and hold more entries, but still won't be global
II) If we all agreed on that state, we will probably have a look at where we want to add additional hooks for events and if we somehow can simplify the process of registering them.
III) After this is done, we will most likely have to revisit your framework and look, what's still missing from there and if it makes sense to actually include this here.
For all of this I would suggest to add this in parallel to the existing interface to avoid breaking changes and to be consistent with the current API wherever we can.
As mentioned above, I am fine with any outcome, but I suggest a good discussion based on playing with the framework and comparing it with current Ignite, then decide how much & what to integrate from framework, then go for it!
What do you think @DrStoop ?
Please ask stuff or let me know about your issues. Does everything work? Bugs?
thx & greets!
hi @vfdev-5,
didn‘t expect such a fast reply;) thx! although i was totally committed to ignite & it‘s concepts & features, there‘s still quite some difference which may take a bit to fully oversee.
but you‘re faster than me it seems, you first point state.run is already a good point i also stooped over. regarding the full concept i left it like that for the moment. issues like that are easily changed, i added it to my list.
there are definitely a lot of details & names and feature that for sure don‘t fit yet, but for the moment it’s mostly about the main concept usability. ah, plus, i just wanted to push the project (as incomplete as it is) to avoid total organizational blindness ;)
i‘ll catch up on this pyt-slack in a few hours & provide you more requested stuff:)
hey, didn't know pytorch-slack is restricted to certain company email. i'm not working for any of them. (...& non of them belong to me either;-) )
I really think the best way is to move forward and start implementing parts of your proposals. I'm sure we'll encounter some problems during that, but I think this is the only way to really get an overview here. Regarding the namespace: I think we should carefully think about what we're adding. So I would not like to add everything just because we can. But I agree that some stuff would be beneficial. But I think these are too many details to discuss them all without an implementation proposal.
And this is also why I suggested the implementation order above: we start with the easy things that would most likely be a good idea and move on to the things we need to discuss base on these implementation details
Okay @justusschock, this is for sure a way to get through it. I'll try to assist setting up an implementation plan by sorting your points (and adding maybe a point here & there) in a way to implement, starting with the major changes:
state
be transient?state
vs. Engine.state
scope?state
?Events
together"state
be transient?Is there a real reason to let Engine.state
disappear? It's instantiated with no dynamic values or depends on anything special as far as I understand:
# Snippet from ignite.engine.Engine
def load_state_dict(self, state_dict: Mapping) -> None:
"""Setups engine from `state_dict`.
...
...
self.state = State(
seed=state_dict["seed"],
max_epochs=state_dict["max_epochs"],
epoch_length=state_dict["epoch_length"],
metrics={},
)
Possible implementation:
This may be something that could be changed quickly, right?
I cannot find any pros for letting state
disappear, or am I missing something?
Just initialize self.state
in Engine.__init__()
and get ride of all the if engine.state is None:
everywhere.
state
vs. Engine.state
scope?There are 2 aspects involved:
state
scope? This is discussed in point 3 (not here in 2.).state
scope into the number of engines?Whatever state
scope we discuss in the next point, here we split it by 2 or 3 (or...), depending on the number of engine. The current Ignite Engine.state
scope actually also includes all attributes of Engine
, only difference is if they disappear or not:
Engine
, e.g. state.n_iteration
, state.n_epoch
and so on. Some guys would maybe search for them directly in Engine
. metrics
split apart is to tell the user, which metric is attached to which engine.Engine.state.metrics
(which is actually metrics
and output_handlers
because they are entangled for some reason) in very Engine
. So when you want to check your implemented metrics, you need to turn on all Engines
(which normally does not happen) and then go through all Engine.metrics
... in reality you "script-scroll". Engine
s are holding loggers
, which is also a kind of redundant state
split.another_engine
:
engine
and another_engine
, this really sounds to me like clue to stick something together that belongs together. EnginesMetricsComparisonCharts
in the framework)? Possible implementations regarding above:
Engine
variables like n_iteration
, n_epoch
be better assigned directly to Engine
?state
scope by 2 or 3 engines useful with its redundant metrics
/output_handlers
. logger
(and whatever comes in point 3), or should all metrics and output handlers and loggers be categorized together? Second choice is a major change in current Ignite.Events
while created, so the user has more overview? Still they won't know for which engine they are allowed, right? I had already implemented something like this for the current Ignite, but didn't PR it, because I started working on the more completed solution.state
?Engine.state
+ Engine
scope (as said above, that's the current effective state
scope) includes:
engine
: output
. n_iterations
, n_epoch
, iteration
, batch
and so onmetrics
& output
dataloader
metrics
)observation
, param_history
, running_reward
, timestep
, and so onestate
.state
holds a scope that already does not differ too much from current Ignite state
state
scope looks larger mainly for the substructure that sorts and categorizes the objects, which is missing in current Ignite state
resulting in more & more "scattered stuff".Possible implementations regarding above:
Events
together"Engine.allowed_events
. So first, can you see somewhere all Events currently existing? Only the default Events
, right? Then you check in each engine if it's "allowed" and register if not?Engine._event_handlers[event_name]
. When looking at Engine._event_handler[event_name]
, Do you still remember where you implemented that event and for what? state
your desired "event"/state object check if there's a <state_object>_callback
variable, maybe have a quick look at the other callback descriptions of already attached processes listed there & then write state.desired_state_object = process
. The process
pops up e.g. in the state.engines.trainer.n_iteration_completed_callbacks
list with auto-generated description, done!Possible implementations regarding above:
Puh!!! It's hard...ideas?
Entangling multiple features in one class causes code redundancies with all its bad consequences:
engine
). Inside this Ignite feature it is passed again to another Ignite feature. This messes up the feature hierarchy level, as metrics and output handlers are equal features. It creates a unnecessary substructure in the architecture.DataLoader()
or SummaryWriter()
.Pipeline
s and <Feature>Composition
s are used to coordinate multiple low-level features to provide the user 1 API providing multiple tasks.Possible implementations regarding above:
Split all features so they only provide exactly one Ignite-task:
So above are just some possible improvements extracted from the framework to implement in the current Ignite architecture. When you think about implementing some of them, this will affect more or less all classes and definitely will not be straight forward! On the other hand, the proposed framework solves all these problems already (and many more) and transferring classes is really straight forward, often simplifying the feature code... So above are the first improvement proposals, let's think about it :-)
hey, didn't know pytorch-slack is restricted to certain company email. i'm not working for any of them. (...& non of them belong to me either;-) )
@DrStoop please give me your email and I'll try to get you an invitation
Let's try to discuss about each point separately:
Why state is None at the begining and reset at each run ? Related questions https://github.com/pytorch/ignite/issues/297 and https://github.com/pytorch/ignite/pull/640
This is "historiacal" and sort of relies on a pattern we would like to adopt. Currently, we have the following logic:
- At the first call, new state is defined by `max_epochs`, `epoch_length`, `seed` if provided.
- If state is already defined such that there are iterations to run until `max_epochs` and no input arguments provided, state is kept and used in the function.
- If state is defined and engine is "done" (no iterations to run until `max_epochs`), a new state is defined.
- If state is defined, engine is NOT "done", then input arguments if provided override defined state.
This covers the cases:
reset
As today, we added the logic to setup engine's state using load_state_dict
, maybe we can reconsider Engine.state = None
at __init__
. However, we should not break covered use-cases.
@DrStoop ping on
I experienced limitations when working with more complex use cases, e.g. multiple engines (up to 3 very common) with customized events and metrics
Could you provide more details or a code snippet for this please?
Okay, let me just say my thoughts on this:
1.) I agree, that the state must not be transient, if it does not hold things like the model and the optimizer. But we have to reset it properly.
2.) The point of splitting up the state is that in your run not all engines might always be related to one model. This is why we are aiming for most flexibility and thus the most local scope possible.
3.) We can have this scope as you propose, if we will have some stuff (like the data loaders) wrapped with weak refs.
Regarding 4. and 5.) I would really postpone this until we decided and implemented the parts 1.)-3.) Since this will most likely change otherwise.
Sure, my email: berr@gmx.de, real name if required: Maximilian Berr.... thx @vfdev-5 !
give me a sec on the multi engine snippets... coming up
@DrStoop I sent a request to send you an invitation. Probably, we need to wait until the morning in US/NYC zone...
@DrStoop ping on
I experienced limitations when working with more complex use cases, e.g. multiple engines (up to 3 very common) with customized events and metrics
Could you provide more details or a code snippet for this please?
@vfdev-5 regarding 3 engines:
You have 3 engines state.trainer
(check state.engines.trainer
), state.xvalidator
& state.evaluator
with different process functions, dataloaders (and if you wanted, also different metrics, but here they're the same):
state.trainer
: process=update
, dataloader=state.trainer_loader
for training the modelstate.xvalidator
: process=infer
, dataloader=state.xvalidator_loader
with only 200 inference-samples, e.g. for early stopping (not implemented in this example)state.evaluator
: process=infer
, dataloader=state.evaluator_loader
testing the final training result on test dataSnippet from example:
### ENGONES ###
#==============
# Trainer
# NOTE:
# - Below the maximum setting of default values is used.
# - The `Engine` is assigned to `state.engines.trainer`.
# - The argument `engine_run_started_ref=None` mean that the calling state object is not set, so `Engine` will
# NOT be called. (Default values are always placeholdered by ``''``)
Engine(name='trainer',
process=state._update, dataloader=state.trainer_loader,
engine_run_started_ref=state.state_run_started_ref)
# X-Validator
# NOTE:
# . To parametrize the engine starting callback use:
# `engine_run_started_ref=state.trainer.get('n_samples_every', state.n_xcal_step_smaples, 'ref')`
Engine(name='xvalidator',
process=state._infer,
dataloader=state.xvalidator_loader,
engine_run_started_ref=state.trainer.get('n_samples_every', state.n_xval_step_samples, 'ref'),
n_samples=state.n_xval_samples)
# Evaluator
# NOTE: Here is demonstrated how arguments can also be skipped during initialization and provided afterwards,
# namely `dataloader` and `engine_run_started_ref`
Engine(name='evaluator',
process=state._infer,
n_epochs=state.n_eval_epochs)
# Setting up `state.evaluator`'s dataloader and callback after initialization
state.trainer.engine_run_completed = ('starte manually appended `state.evaluator.run` with `state.evaluator_loader',
state.evaluator.run, {'dataloader': state.evaluator_loader})
Now, the following is an API you could not implement in current Ignite without using another_engine
and a_third_engine
, because features normally only reference one engine
, or am I wrong?
For the comparative multi-graph charts in this script you'll find (snippet from the example):
# Accuracy charts of all engines, 3 metrics in 1 chart
# NOTE: Here the maximum default values are used, setting as little as neccessary
# EXTRA NOTE: Leaving out the `chart_name` in charts with many `y_axes_refs` usually leads to very longish chart names
ScalarsChart(x_axis_ref=state.trainer.n_samples_ref,
y_axes_refs=[state.get(trainer_accuracy_name).output_ref,
state.get(xvalidator_accuracy_name).output_ref,
state.get(evaluator_accuracy_name).output_ref],
caller_refs=[state.get(trainer_accuracy_name).output_ref,
state.xvalidator.engine_run_completed_ref,
state.evaluator.engine_run_completed_ref])
# Loss charts of all engines, 3 metics in 1 chart
# NOTE:
# - Here maximum manual indivializations are called, some arguments are redundent in the way that they
# equal (or behave equally to) the default value, e.g. the `caller_refs` would result in same chart.
# - The `walltime_ref` uses the state timer `state.maintenance.state_timer.state_run_started_time_in_sec`
ScalarsChart(x_axis_ref=state.trainer.n_samples_ref,
y_axes_refs=[state.get(trainer_loss_name).output_ref,
state.get(xvalidator_loss_name).output_ref,
state.get(evaluator_loss_name).output_ref],
caller_refs=[state.trainer.n_samples_every_100_ref,
state.xvalidator.engine_run_completed_ref,
state.evaluator.engine_run_completed_ref],
y_names=['trainer_loss','xvalidator_loss', 'evaluator_loss'],
chart_name='engines_loss_comparison',
bso_ctr_name_suffix='manually_set_chart',
summary_writer=state.tensorboard.summary_writer,
summary_description='manually written description for summary adding',
walltime_ref=state.state_run_started_time_in_sec_ref)
This results in the tensorboard charts posted above (where I should mention that the upper charts evaluator_trainer_xvalidator/accuracy
& loss
not yet plot the evaluator graph because I stop the engines before it was calculated).
Note: This high-level code line below replaces all above ScalarsChart()
s, and would do so also for any number of engines and metrics (snippet for here):
# Automatically identify and generate metric chats comparing the different engines
EnginesMetricsComparisonCharts(x_axis_ref=state.trainer.n_samples_ref, n_identical_metric_name_suffixes=1)
Now, you may argue that state.xvalidator
and state.evaluator
have the same process
-function & could be fed with different dataloaders. Yes, that's right and you can also do that in the framework. Still you have 2 engines to deal with in a API for comparative metric charts. And normally you have different processes attatched to state.evaluator
and state.xvalidator
, i.e. state.xvalidator
's loss metric may trigger the early stopping and modelcheckpointing and have different metrics compared to the final state.evaluator
that only calculates the end success with a loss metrics.
When thinking about complex use cases, one idea for simple parallel/distributed computing would be to implement all required infrastructure into let's say a StateProcessPoolVariable
that replaces the current (slightly modified) callbacks list
with a multiprocessing.Pool
(or a StateThreadPoolVariable
for multiprocessing.pool.ThreadPool
). You use this StateProcessPoolVariable
for state.state_run_started
and append all kinds of engines training in parallel. The user realises nothing about all that, she/he just initializes the different engines with Engine(..., engine_run_started_ref=state.state_run_started_ref,...)
and finds all engines listed in state.engines
. The parallelel/distributed state object state.engines.state_status.state_run_started
will trigger all appended engines of its callbacks in a multiprocess.Pool
(or ThreadPool
) - everything is parallelized. In this case you will need tons of features that can handle multiple engines in parallel. Can you implement that in current Ignite?
...in the framework that is not implemented yet, but that is just doing... you could also just switch parallel/distributed computing on/off with a default switch in state.configs.default_configs.parallel_processing = True
.
The nice thing about the metaclass architecture is, you can theoretically rewrite everything before starting the engines wihtout messing with the actual architecture. You can isolate tools like that from the rest in state plugins.
Actually, just realized that our more advanced examples have also 3 engines. For example,
In our cases similarly to your, 3 engines are trainer, evaluator on validation and evaluator on training subset. Two evaluators can not be merged in one simply as validation evaluator has more handlers: save best model on validation score, early stopping etc...
following is an API you could not implement in current Ignite without using another_engine and a_third_engine, because features normally only reference one engine, or am I wrong?
This feature is a shortcut for global_step_transform which can be configured as user would like and with any number of engines...
Well, sorry, it is a bit difficult to follow your explanations without a context. In my case, DDP work OK for 3 engines as in above example using torch.distributed.launcher
. Such that whole code is running in a process and the noly synchronization on ignite's side should be done in metrics (reduction across the processes of metric's ingredients).
I agree that if we try to code proces splitting inside a single python script spawning engines, metrics could be an issue as stated in this issue : https://github.com/pytorch/ignite/issues/623
Okay, let me just say my thoughts on this:
1.) I agree, that the state must not be transient, if it does not hold things like the model and the optimizer. But we have to reset it properly.
Correct, there are many ways to easily reset. With reseting in mind I implemented the
StateObject.__delete__
to reset the current value to itsinitial_value
. (Also because deleting a class descriptor from its class just does not make sense at all.)
In general you can implement it in 2 ways:
BaseFeatureType
) specifically and append it to the according state object callbacks (e.g. state.trainer.engine_run_completed
or state.state_run_completed
) during __init__()
, e.g.:
from ignite_framework.states import state
from ignite_framework.transitions import BaseTransition from ignite_framework.feature_dev_tools.state_objects import StateVariable
class NewFeatureWithResetEachStateRun(BaseTransition): a_hyperparam = StateVariable(initial_value=0.42)
def __init__(self, name):
super().__init__(name=name)
def reset(param_ref):
del param_ref.caller_name
state.engines.state_status.state_run_completed = ('reset `a_hyperparam`', reset, [self.a_hyperparam_ref])
if name == 'main':
NewFeatureWithResetEachStateRun(name='demo_resetter')
# Change hyperparameter value (Note: shortcut for `state.transitions.demo_resetter.a_hyperparam = 0.234...`)
state.demo_resetter.a_hyperparam = 0.234426523
print('Before state run completed: ' + str(state.demo_resetter.a_hyperparam))
# End state run (Note: shortcut for `state.engines.state_status.state_run_completed = True`)
state.state_run_completed = True
print('After state run completed: ' + str(state.demo_resetter.a_hyperparam))
The debugger overview results as:
![reset_feature_in_debugger](https://user-images.githubusercontent.com/19177740/75461096-ef196800-598a-11ea-8fc6-924263bae5c1.png)
The consol result:
```python
Before state run completed: 0.234426523
After state run completed: 0.42
state.maintenance
) to reset e.g. all parameter scheduler found in state
when state
completes its run, also easy.2.) The point of splitting up the state is that in your run not all engines might always be related to one model. This is why we are aiming for most flexibility and thus the most local scope possible.
Which engine uses which model is defined by the
process
function which again is unique for each engine in both current Ignite & the framework. So this is unrelated to the question of splitting up the state.
In fact, Ignite could be termed as too "chunky" in a way regarding its referencing method. This is why:
state
we are discussing about a splitting factor of 2 or 3 (engine.state
s vs. one framework state
). That's not a big difference.engine.out
in a metric. That's a huge chunk of unused objects!StateObjectsReference
-class which slices out a single state object from the complete state
as a reference and provides it as the exactly needed argument. This is implemented in a pretty syntax:
# Create reference object with simple suffix `_ref` at the end of the state object name
>>> nicely_sliced_state_object_reference = state.state_run_started_ref
caller_name
nicely_sliced_state_object_reference.caller_name Out[3]: False
nicely_sliced_state_object_reference.caller_name = True nicely_sliced_state_object_reference.caller_name Out[5]: True nicely_sliced_state_object_reference.caller_name_callbacks Out[6]: [('start timer
state.state_timer.state_run_started_in_sec/min/h', <bound method StateTimer.start_timer of <ignite_framework.states.state_plugins.state_timer.StateTimer object at 0x7f4921b726a0>>, [], {'timer_name': 'state_run_started'}), ('sync
state.configs_status.config_run_started', <function ... # and so on3.) We can have this scope as you propose, if we will have some stuff (like the data loaders) wrapped with weak refs.
Good point!
Regarding 4. and 5.) I would really postpone this until we decided and implemented the parts 1.)-3.) Since this will most likely change otherwise.
Actually, just realized that our more advanced examples have also 3 engines. For example,
* https://github.com/pytorch/ignite/blob/master/examples/references/segmentation/pascal_voc2012/code/scripts/common_training.py#L87
In our cases similarly to your, 3 engines are trainer, evaluator on validation and evaluator on training subset. Two evaluators can not be merged in one simply as validation evaluator has more handlers: save best model on validation score, early stopping etc...
following is an API you could not implement in current Ignite without using another_engine and a_third_engine, because features normally only reference one engine, or am I wrong?
This feature is a shortcut for global_step_transform which can be configured as user would like and with any number of engines...
okay, is this what global_step_transform
is doing?
trainer
is used as a global objectglobal_step_transform
is used as a reference object that slices out trainer.state.iteration
and passes it into the OutputHandler
?Comparison with the Ignite framework:
global_step_reference = state.trainer.n_iteration_completed_ref
... works with arbitrary state.<state_obejct_name>_ref
Now compare the current Ignite:
# current Ignite
tb_logger = TensorboardLogger(log_dir="experiments/tb_logs")
def global_step_transform(*args, **kwargs):
return trainer.state.iteration
tb_logger.attach(evaluator,
log_handler=OutputHandler(tag="validation",
metrics=["nll", "accuracy"],
global_step_transform=global_step_transform),
event_name=Events.EPOCH_COMPLETED)
And here the Ignite framework:
# Ignite framework
ScalarChart(x_axis_ref=state.trainer.n_iteration_completed_ref, y_axis_ref=state.evaluator_accuracy.output_ref, caller_refs=state.evaluator.engine_run_completed, walltime_ref=state.state_run_started_time_in_h_ref)
Difference noticable? In the framework I added a effective runtime of the state as walltime
, that's not implemented in the current Ignite.
Parallel/Distributed computing
Well, sorry, it is a bit difficult to follow your explanations without a context. In my case, DDP work OK for 3 engines as in above example using
torch.distributed.launcher
. Such that whole code is running in a process and the noly synchronization on ignite's side should be done in metrics (reduction across the processes of metric's ingredients).ah, I'll check... totally agree on not implementing anything already existing in PyTorch!
I agree that if we try to code proces splitting inside a single python script spawning engines, metrics could be an issue as stated in this issue : #623
Hi!
I was thinking for some time about the current architecture of Ignite and how to improve the workflow both, during application development (writing training scripts) and during feature development. When writing training scripts I could not code the individual training I wanted and when trying to code the feature for my training script I ended up writing infrastructure/architecture code instead of implementing the feature. There were some kind of restrictions I first couldn't identify...
Nevertheless, after a while I found 2 twists, actually nothing big... so here I want to come up with a under-the-hood-framework to integrate into Ignite that solved all my problems and many open issues! Nice right?
With this framework integrated into Ignite you achieve an extreme nice overview during debugging, enhance Ignite to a rapid feature dev tool, can handle far more complex (individual) use cases while achieving a higher degree of automation at the same time and have quite some new features and many more possibilities for more syntactic sugar.
But now comes the... BUT as I tried to fix it, unfortunately I had to realize it won't work without major revisions. For that I went a long way to really provide proof and facts - something you can play with to make up your opinion - before daring to suggest a major revision...
So, if you're interested in a up&running "what-if-when"-Ignite version with the 2 twists below untwisted, please have a look at the repository and the documentation and leave me a feedback - I'd really like to know your opinion.
In case enough of you like it and could imagine integrating the framework into Ignite, I could pull/request the code on an experimental branch and we see how it goes from there. (Note: I just pushed it to another repository because as far as i know you cannot pull/request a new branch - which this definitely needs.)
Everything else you need to know you will find in the Ignite Framework repo and the docu. For bugs & questions, let me know, thx!
So, set up your first coffee & enjoy playing!
Teaser from the documentation
Two issues
I am a fan of Ignite and that's why I'm trying to contribute, but I discovered 2 shortcomings in the architecture and the implementation, that caused me quite some restrictions and coding infrastructure instead of programming new features (what I actually wanted to do). The issues are:
Engine
is the architectural center with the trainingstate
as attribute. The trainingstate
atttribute is a transient object that is only instantiated when theEngine
is inrun
-mode and vanishes afterwards. Also thestate
holds only a selective fraction of all variables and parameters that make up the real training state. SoEngine
is a kind of static object andstate
is transient. This does not represent the reality of the training process. In reality the training starts with an initial state holding all variables, parameters including e.g. model variables, hyperparameters etc. which then are modified while the state goes through different transitions. The main transitions of thestate
areEngine
s (normally more than one). So the state should be the architectual center holding ALL variables, parameters, values, transitions etc. and the Engine is (just) the main trainsition of the state. This small twist causes quite some complications for features and APIs which are listed below.Event
is anEnum
that has to be explicitlyfire_event
ed, and implicitly_fire_event
ed so theevent_handlers
handle further callbacks. TheEvent
is always fired after some other training value has changed, e.g. the model output was updatedITERATION_COMPLETED
is fired. Also if you want to fire a non-standard event, you first have to create it, register it at eachEngine
that is supposed to use it and then the firing has to be implemented... But in reality an Event is nothing more than a value change of a training state variable that triggers callbacks. So all these pieces above can be put together by implementing a state variable as a descriptor.Improvements from an underlying framework
You will experience the improvements given by the framework when working on all 3 levels: application implementation, feature development and framework development. The separation of these working areas is already the first improvement. Try out the benifits in detail & hands-on for the first to levels in the Quickunderstanding Application and Quickunderstanding Feature Dev.
Before you go through the theoretically described enhancements these few
no-comment-teasers
of the trainingstate
in the debugger will give you nice insights what's ahead. It shows the Ignite example mnist_with_tensorboard.py transferred to the framework architecture just before the engines are started:And here the engine
state.engines.trainer
unfolded:Or setting up all the below Tensorboard charts with these two simple comands:
By the way, if you had set up 10x more metrics and some more engines, these two command would not change to provide all comparative and single metric charts of all engines.
Soooo, if you're intrested, then grab a coffee and press >>>PLAY<<<!