mila-iqia / blocks

A Theano framework for building and training neural networks
Other
1.16k stars 352 forks source link

Event Based Main Loop #87

Closed rizar closed 9 years ago

rizar commented 9 years ago

Recent post from Jan in #85 revealing his vision of how monitoring should work made me think further of what kind of main loop we want to end up with. If we take the old Groundhog and PyLearn2 main loops and push almost everything out of them as Jan suggests, we end up with a tiny skeleton managing a set of callbacks. In this way we are moving towards an event-based framework instead of one with a fixed interaction scenario, even though it sounds a bit scary at the first sight.

Another argument to show that we are very close to introducing events: in Montreal we discussed the interface between the datasets and the rest of the library. We converged to an object which Jan calls "a view" and I call "a data stream", that creates an epoch iterator, which in turn creates a batch iterator. Whereas this covers many common cases like MNIST (when an epoch is a pass over a dataset) and nearly endless datasets like the one we used in machine translation (an epoch can be defined as, e.g. 100 batches), it does not scale: if I am training a skip-gram model, and I want to do something when I process all n-grams from a sentence, I have to declare a sentence an epoch, which might be not a good decision in other aspects. If I want to have arbitary many time scales (word, sentence, paragraph, document, book), it becomes very challenging unless a notion of an event is defined. In fact, Jan already argued we need it, but he proposed to mix them with the data itself which was not embraced by Bart and myself.

So long for an introduction, and here comes the point: let's look at the main loop as a scheduler that triggers certain handlers when certain events happen. Examples of events include:

When we have multiple events happening simultaneously (e.g. a tick and an epoch), the order of their processing should be somehow determited. We could have a priority associated with every event, e.g. 0 for a tick, 1 for a keyboard interrupt, -1 for an epoch. Events with a higher priority can be handled before events with a lower priority.

Also, we will typically have a few handlers for every event that should be executed in order. Let's recall Jan's example: we want to save after a validation. To do that we assign a chain of two handlers to the EpochEndEvent: validation followed by saving.

There might be a necessity for interaction between handlers. E.g. if we want to finish an epoch and terminate after a keyboard interrupt is received, we need two actions to be done for two different events. After the interrupt we should set a flag, which should be checked at the end of each epoch. The backend monitor proposed by Jan can accomodate this flag. In fact this monitor is much more than a logging mechanism: it is a persistent time-indexed array through which the components of the system can interact with each other, and the user can access whole interaction history.

It is possible that handlers will generate events as well. E.g. the EpochEndEvent will be called by a handler of the TickEvent, the main one, responsible for fetching data and doing a descent step. To allow that, every single component of the system must have access to the main loop object. That looks a bit weird when even iterators have a link to the main guy, but perhaps this the price to pay.

This text should obviously be followed by lots of examples, but so far I will just share these raw thoughts cause I do not know when will have time to write some code. The main point is: the generality we aim for requires complete decentralization, and for this I think we have to think in terms of events, not in terms of a fixed algorithm with a few hooks.

bartvm commented 9 years ago

Sorry, completely forgot to respond to this earlier.

Maybe I need to see the code, but at first sight, I don't like the idea of an event-based framework. Normally an events-based framework needs to respond to events/signals that are triggered externally, stochastically, etc. In our case, there might be many things going on, but they always happen in a deterministic order, are completely controlled by the framework, and they don't need to be managed asynchronously. As such, it feels like we are adding extensive, non-obvious design patterns to handle corner-cases.

It is clear that GroundHog and Pylearn2's logging functionality is too limited, but people have used it successfully for a long time without any persistent complaints that I am aware of. I was warned by yourself @rizar that we shouldn't try to make things too general and too perfect. It seems that we've gone from the very rudimentary monitoring in GroundHog (batch only) and Pylearn2 (in between epochs only) to a complete events-based framework with callbacks, managing multiple time-scales using "aggregation schemes", handling cold and warm shutdowns, etc. Perhaps some of these things shouldn't be built in e.g. it's not the end of the world if someone who wants to monitor statistics on 20 different time scales is expected to write a custom iterator which logs the data necessary for him to reconstruct the graphs he needs. If we have stateful "data streams" anyway, can't all these different time-scales just be part of their state, which are logged together with all the other data for the user to go through?

rizar commented 9 years ago

It is true that we have no stochasticity and external influence to be handled in our framework. But as you said, we have a lot of deterministic stuff happening, each time with different frequency and/or in different order. I do not know a rigid scheme of a main loop that would suffice all the cases I care about, and propose to replace it with a constructor of such schemes. For the most frequent cases, e.g. when we iterate over a finite dataset with validation and saving, a wrapper may be designed.

I am afraid that you are wrong that logging from old frameworks did not create complaints. Ask @janchorowski what he thinks about it.

Perhaps some of these things shouldn't be built in e.g. it's not the end of the world if someone who wants > to monitor statistics on 20 different time scales is expected to write a custom iterator which logs the data > necessary for him to reconstruct the graphs he needs. If we have stateful "data streams" anyway, can't > all these different time-scales just be part of their state, which are logged together with all the other data for the user to go through?; Not sure I understand what you mean here. But I want to clarify: it is not only monitoring that I would like to be able to do when certain data-dependent events are triggered, I might want to save a model or do whatever else.

Regarding extra generality: while it is true, rewriting a PyLearn2 and reproducing its deficiencies would be also undesired. It was hardcoded there, that a training procedure includes a very particular type of validation, serialization-based saving, handling epochs and that's it. Namely because of this assumption is it so hard to reuse in any non-standard situation. But blocks is a different sort of a framework: we provide highly-reusable tools without imposing a philosophy. We work at lower level, and the lowest level possible for writing a main loop is to view as just a neat way to arrange bunch of callbacks.

bartvm commented 9 years ago

I'm aware of @janchorowski's complaints, but if it's just 1 or 2 users (and I don't know of more), it's likely that it's a corner case. I think it's important we support those corner cases, but I think it's okay if we expect a little bit more work from those users instead of making the code hard to read for the simple cases.

If it's a matter of measuring over multiple time scales I don't see what's wrong with just using the data stream to denote these time frames. It is stateless after all, so we can just log that state as part of everything else. If you wanted fine-grained statistics over words, sentences, paragraphs and documents you would simply store e.g. the likelihood of each word together with the sentence, paragraph and document IDs as given by the data stream. You can then just use a bit of Pandas magic to calculate the average likelihood of each sentence, paragraph, etc.

rizar commented 9 years ago

I think it is just pointless to complain on something hardcoded in the very heart of a system whose development is effectively frozen, and people did not. In addition most of people still work with finite datasets of images on which all shortcomings of PyLearn2 are hidden.

Your last paragraph was not clear for me, so I suppose we switch to the language of examples. Suppose I am training an NLP model on Wikipedia, want to save every time a sentence is processed and run a validation procedure every time a document is processed. How do I do that?

I am leading you to the point that having a single event "epoch", hard-coded by means of "iterator of iterators" paradigm, is rather ugly and very domain bounded decision we might have to pay for.

janchorowski commented 9 years ago

Hi,

sorry to both @rizar and @bartvm: I overlooked the thread and noticed it only yesterday when github sent me the update emails. I tend to agree with both of you (even though you disagree). First of all, as @rizar points out,

we provide highly-reusable tools without imposing a philosophy.

which aligns well with @bartvm suggestion to code for the common case, but move as much as possible to supporting functions, which will allow easy coding of a custom main loop. In my intuition a properly coded common main loop shouldn't be longer than 20 lines of pure code (no docstrings).

To behave better in the new year, I'll stop complaining and say what I actually like about PL2 and GH:

In PL2 I like the idea of moving most decisions out of the training loop through TrainExtension and TerminationCriterion. I (and many other people) miss the ability to monitor things on a smaller than epoch time scale. What I don't like, is that while we have classes for Dataset, Channel, Monitor, Model, etc. they all depend on each other in non-trivial ways. While the design looks modular, every single piece of it is so deeply tied to all other pieces, that it is difficult to pull something out and replace it.

In GH I like the way the model says what it wants monitored through the properties field and that it is recorded for each mini-batch. I don't like how the main loop tries to be everything for everyone, doing validation, monitoring (on a limited number of data sets), runing hooks, and finally assuming that all training algorithms will be some form of gradient descent on small mini-batches.

I'm coding some proof of concept right now, should push it later today or tomorrow to have some idea of how it may look like.

rizar commented 9 years ago

@janchorowski , you may also take look at my pull request #88.

bartvm commented 9 years ago

Given that is a pretty uncommon use case (saving every sentence, validating every document) I would argue the user can be expected to write his own main loop, which would look like:

minibatch = next(data_stream)
if data_stream.sentence != current_sentence:
    serialize(computation_graph)
    current_sentence = data_stream.sentence
if data_stream.document != current_document:
    monitor.validate(validation_set)
    current_document = data_stream.document
training_algorithm.train(minibatch)

That is, if the user wants to perform actions based on the state of the iterator beyond just being exhausted or not (which signals the end of an epoch), they are expected to request the state of the data stream and program their own logic to act on it.

rizar commented 9 years ago

Well, I see, you both offer writing a new main loop to solve a new task. While I agree that it will do the job, I argue that in such an approach we miss code-sharing between main loops. I can not take the standard main loop and introduce a couple of modifications in it, I have to copy it and modify the copy.

The event-based framework I sketch in my pull request #88 is a toolbox to construct new main loops and modify existing ones. I do not think the user should work at such a low level every time, but it when suddenly he needs a non-standard modification, he should not have to copy the whole thing to do it.

rizar commented 9 years ago

In addition having institutionalized callbacks (aka events) exempts us from this rather involved "iterator of iterators" scheme. Data becomes a pure flat stream, which simplifies writing intermediate stages of data preparation. Epoch becomes just another event.

bartvm commented 9 years ago

The code-sharing problem would be solved by using a TrainExtension similar to Pylearn2, as @janchorowski suggested. I quite like this feature of Pylearn2, and many complicated use cases can be implemented with it while keeping the common cases simple. These extensions could be called at different places in the main training loop to perform these checks. So saving and monitoring would be something like:

class CustomExtension(TrainExtension):
    cur_sentence = 0
    cur_document = 0
    def before_train(data_stream, computation_graph, ...):  # Or just pass the Train object?
        if data_stream.sentence != cur_sentence:
            serialize(computation_graph)
            cur_sentence = data_stream.sentence
        # etc.

Train(..., extensions=[CustomExtension()]).main_loop()
rizar commented 9 years ago

Nice, I also like training extensions. But:

  1. You have to anticipate at development time all the possible places where one may want to insert something, like before_train. When I was hacking with PyLearn2 I suffered because I could not run a method of my extension in the exact place where I wanted it. So you need before_training_starts, before_data_fetched, after_iteration, after_training and so on: the extension's interface is gonna be huge.
  2. Whatever you put not in the extension will be impossible to switch off. Like this pickle-based serialization in PyLearn2: we have to live with it forever. So in fact everything is gonna be in an extension and you end up again with an event-based framework (with a finite set of possible events though).
  3. What about order in which extensions are called? It can be quite limiting to have it the same for all handler methods.
bartvm commented 9 years ago

I don't think (1) is a big shortcoming. Whenever the need arises you add a single line to the core code (e.g. some helper function call_extensions('before_data_fetched')) and then you write an extension

class SimpleExtension(TrainingExtension):
    def before_data_fetched(args):
        pass

If we think that it is a sensible place to interject code, I have no issue with submitting this extra line of code to the core library.

I don't fully understand (2), because it seems a truism: Everything we decide is core code and implemented as a single class, can't be "switched off". By the principle of modularity the user can choose not to use this class, but we can't make every line of code optional. I personally don't have a large problem with keeping the core small, and implementing a lot of things in extensions. It might just be a matter of terminology, but that doesn't sound like an events-based framework to me because the communication still goes from the main loop to the different modules, and not the other way around (i.e. the main loop calls the shots, it's not just responding to events fired all over the place).

If (3) turns out to be a problem, we could fix it by e.g. writing a decorator that gives a priority to each handler and (stably) sorting by these priorities (each handle having a default priority of 0).

SomeExtension(TrainExtension):
    @priority(-999)  # Sets self.priority['before_data_fetch'] = -999
    def before_data_fetch(args):
        pass

    @priority(999)  # Needs to be done first
    def before_train(args):
        pass
def call_extensions(handle):
    for extension in sorted(extensions, key=lambda extension: extension.priority[handle]):
        getattr(extension, handle)(data_stream, computation_graph, ...)
rizar commented 9 years ago

About (2): this is not just a truism, because in they way I saw it and implemented it every single line of code could be switched off (except the event loop itself, of course). I suggest you take a look at my pull request at some point even though I do not plan to merge it so far.

So let's sum up what we are converging to: a very skinny main loop with slots for callbacks everywhere. The only thing it does itself is fetching a batch and passing it to the training algorithm, all the rest: logging (in both new sense introduced by Jan and the good old one), saving, monitoring, validation - all goes to training extensions and thus becomes optional and replaceable. Data is provided by a data stream which is a sequence of iterators over so-called epochs. Whenever some additional information is needed (e.g. knowledge whether the data stream can be right away pickled or that the current sentence or document has changed), the user is expected to query the data stream explicitly. Please confirm that this is the same you have in mind.

I could live with it (at least temporarily, because we need to wrap a first version of blocks up). This main loop would be suitable only for semi-online optimization algorithms with data grouped into mini-batches, but this is all we do. Supporting SVMs and kmeans and whatever, like PyLearn2 claims to do, can be postponed.

bartvm commented 9 years ago

That sounds accurate. I had a look at your PR and I believe the two ideas are actually not too different. What you call "handlers" is similar to "extensions", just the way in which they are called is different: in your case a priority queue of events, in my suggestion a fixed sequence of callbacks throughout the main loop.

I'll focus first on getting the datasets and monitoring up and running. I should finally have some time the next few days to effectively merge my PR with @janchorowski's one in order to give us functional monitoring, and to get a first version of datasets working. I agree on needing to get a first version out; people are interested in using Blocks for ICML submissions, which are only a month away!

rizar commented 9 years ago

Agreed. The difference between handlers and extensions is that the former are just callbacks, whereas the latter are sets of callbacks grouped according to their their purpose. Also in my pull request all the logic is supposed to be distributed between handlers, whereas in the approach we converged to above some code is located directly in the main loop class. I am not sure if we need extensions to group callbacks, but we can have them for a while to ease transition from PyLearn2. I do not think we need a separate TerminationCriterion though: this can be another extension that just sets a stop flag.

If you focus on finishing datasets and monitoring, I can sketch the skinny main loop, although my time for work is still and will be very intermittent. What do you think about such a way of separating the workload?

Also please tell me what do you think about the TrainingLog interface I have (mainly inspired by what Jan called a "backend"). I like the idea of having the components interact with each other by means of making records in the shared log. The current row can also be printed in a human-readable form in an extension. Under the hood the whole thing can be implemented as a Pandas data-frame that is expanded once in a while or as a database table.

bartvm commented 9 years ago

Sounds good. I started work on merging my and @janchorowski's stuff already. Will hopefully have something functioning in the upcoming days.

Your TrainingLog interface looked good. There are some things I would argue about (like, whether returning None instead of just raising an error is a good default setting) but in principle I like the idea of having an abstract interface with multiple backends. However, this is different from @janchorowski's right, who suggested having both a key (channel name) as well as a context (e.g. which dataset the channel was monitored on). Pylearn2's approach would simply be to give them keys like cost_valid, cost_train. I'm not necessarily opposed to that idea if it keeps things nice and simple (the user can always do a simple startswith('cost_') if needed).

janchorowski commented 9 years ago

Hi,

it took me a while to understand @rizar #88 and after sleeping over it I start to like how the set of possible events is not limited. My last worry about it is that something a'la PL2's extensions is much easier to understand and nearly as flexible (you can always add some logic to the most frequently called callback). So #88 would impose a steeper learning curve on users.

My unfinished take (started a few days back, so slightly before this discussion) is at: https://github.com/janchorowski/blocks/tree/future_model Feel free to copy from it, trash it or ask for modifications. I changed the naming of aggregation functions.

After thinking how I would implement various extensions (namely: monitoring, early stopping, saving of best model, assertions, enforcing model constraints) I realized that:

  1. The callbacks have to be able to declare what information they need:

    • for the per-batch callback, we want to compute values during the fprop call.
    • for all other callbacks, we typically want to know statistics on auxiliary datasets (e.g. the performance on validation data to decide on stopping).

    For speed reasons the computation of those values should be performed once. Thus the callbacks need to somehow tell what they need.

    The logic to indicate what the callback wants to know has to be simple - I resorted to just declaring the theano variables one callback needs, then storing them in a double dictionary: context -> variable -> value. In this way we distinguish between e.g. cost on validation and test data, and don't rely on stuff that may not be unique (i.e. variable names). Name mangling as in PL2 is also possible.

  2. The callbacks have to be able to add updates to the fprop part and censor the parameter updates. In that way common tasks, such as parameter norm constraints can be moved to single extension. This is missing in PL2.
  3. The callbacks have to execute in order - thus in #88 the handler would need to have their own priorities. Then we know that the standard callback that computes values of things runs first, the logging one runs last etc.
  4. Similarly to #88 I think that the callbacks should communicate through a global data structure, such as a dict or log. I now think that a dict of values computed during this event is simple enough and sufficient - all historical information can be stored in the callbacks themselves. A log queryable by time is both more complex and still limited - you cannot ask e.g. the value of validation cost at last epoch, because you have to know when last epoch was. So you either have to memorize it (just as well can memorize the validation cost itself), while if we permit such complex queries, we may just as well assume an SQL schema and rely on the fact that sqlite is standard in Python.
rizar commented 9 years ago

1)

for the per-batch callback, we want to compute values during the fprop call.

I do not understand what you mean here. If you are speaking about the way to implement cheap per-batch monitoring, than I think the respective training extensions should have direct access to the training algorithm. Then they can add updates modifying shared variables of interest to the training algorithm and later push the values of these variables into the log.

algorithm = TrainingAlgorithm(...)
monitoring_extension = MonitoringExtension({'gradient_norm' : gradient_norm}, algorithm)
main_loop = MainLoop(data_stream, algorithm, extensions=[monitoring_extension, ...])

for all other callbacks, we typically want to know statistics on auxiliary datasets (e.g. the performance on validation data to decide on stopping).

For this the user can and should directly access the training log (see #88).

2) Adding updates should be done by accessing the algorithm directly. I do not like the idea of adding parameter norm constraints in training extensions: I thought we regularize at a stage before training.

4) I do not understand you. Can you say what's the problem with the log from #88?

bartvm commented 9 years ago

Just to confirm: So we want training extensions (and hence the main loop) to explicitly rely on the log. This does introduce a dependency between parts of the library, but since the log is basically just a glorified dictionary, it sounds okay to me.

In general, perhaps @rizar's suggestion of giving the training extensions direct access to e.g. the training algorithm might be easier (and is what I had in mind as well). So instead of a complicated system in which training extensions can specify what information they need, they just get access to the training algorithm, the main loop object, the log, the computation graph, etc. Any variable that the training extension needs to modify should then be exposed by a method or an attribute on one of those classes.

@janchorowski, regarding (3) you might want to have a look at point 3 in @rizar's comment (https://github.com/bartvm/blocks/issues/87#issuecomment-68874220) and my reaction (https://github.com/bartvm/blocks/issues/87#issuecomment-68878790)

rizar commented 9 years ago

Just to confirm: So we want training extensions (and hence the main loop) to explicitly rely on the log. This does introduce a dependency between parts of the library, but since the log is basically just a glorified dictionary, it sounds okay to me.

Yes, that's what I want. In PyLearn2 it was not common to read anything from monitoring channels and thus information was duplicated.

janchorowski commented 9 years ago

Let me answer in reverse order:

@janchorowski, regarding (3) you might want to have a look at point 3 in @rizar's comment (#87 (comment)) and my reaction (#87 (comment))

I agree with you. I think it is simpler to set priorities by extension, rather than by callback.

I do not understand you. Can you say what's the problem with the log from #88? It is both too complex, i.e. seems to be more than a dict, and too limited, i.e. to access any non-trivial historical information you need to have information outside of the log. So a simple dict seems better.

The log in #88 gives access to old values by time. This assumes that extensions will somehow know the relevant times. E.g. to query the cost at last epoch you need to store in the extension the time of the last epoch. So you may just as well store in the extension the cost at last epoch. And never access historical information form the log. This makes the log simpler. If, on the other hand you want to be able to answer queries such as "cost at last epoch" automatically, then perhaps storing the log in an SQL table will give the most flexibility for the added complication.

I do not understand what you mean here. If you are speaking about the way to implement cheap per-batch monitoring, than I think the respective training extensions should have direct access to the training algorithm. Then they can add updates modifying shared variables of interest to the training algorithm and later push the values of these variables into the log.

I don't like the idea of giving the extensions direct access to the main loop and the training algorithm, because then we have to specify their interfaces, or we end up in a situation like in PL2 where we will have exactly one training algorithm because some extension relied on it having a particular field, a single main loop etc.

If the extension says what it wants and we essentially limit its impact on the training algorithm to: I want those things computed during the fprop, I want to see how you change the model we limit the constraints on a main loop and training algorithm.

This is not to say that I don't want algorithm-specific extensions, e.g. to set the learning rate. All I want are very generic extensions, that make no assumptions on the specifics of the training algorithm.

janchorowski commented 9 years ago

Yes, that's what I want. In PyLearn2 it was not common to read anything from monitoring channels and thus information was duplicated.

It was quite common. There are e.g.: MonitorBasedSaveBest or a few descendants of MonitorBased termination criteria.

It shows that having such a log is a good idea, indeed.

rizar commented 9 years ago

The log in #88 gives access to old values by time. This assumes that extensions will somehow know the relevant times. E.g. to query the cost at last epoch you need to store in the extension the time of the last epoch. So you may just as well store in the extension the cost at last epoch. And never access historical information form the log. This makes the log simpler. If, on the other hand you want to be able to answer queries such as "cost at last epoch" automatically, then perhaps storing the log in an SQL table will give the most flexibility for the added complication.

The historical information must be kept in the log, as this is the only storage of monitoring information. The question you pose is whether it should be accessed. The access pattern from your example is not hard to support: we just need to keep a list of epochs' last iterations. But in more complex cases we have to fall back on your scheme of remembering the necessary information in the extension.

janchorowski commented 9 years ago

The access pattern from your example is not hard to support: we just need to keep a list of epochs' last iterations.

But this adds a special case. You either give the extensions only the current values and make them keep everything historical they want (we need to save their state somehow, but we'll have to do it anyway), or make the log the only thing that is saved, and make querying powerful.

rizar commented 9 years ago

I don't like the idea of giving the extensions direct access to the main loop and the training algorithm,
because then we have to specify their interfaces, or we end up in a situation like in PL2 where we will have > exactly one training algorithm because some extension relied on it having a particular field, a single main
loop etc.

We just need to be careful with their interfaces. I don't like the idea of extensions having complicated languages of requests to be interpreted by the main loop.

rizar commented 9 years ago

But this adds a special case. You either give the extensions only the current values and make them keep everything historical they want (we need to save their state somehow, but we'll have to do it anyway), or make the log the only thing that is saved, and make querying powerful.

I think that remembering when last epoch ended covers a lot of situations when log is needed and should be supported by default. I do not think that supporting an involved query language is necessary.

janchorowski commented 9 years ago

We just need to be careful with their interfaces. I don't like the idea of extensions having complicated languages of requests to be interpreted by the main loop.

The language does not need to be complicated - it boils down to three things: what updates does it want processed along with fprop, what changes it wants the make to parameter updates, and what values it needs computed. By default everything is computed on all possible data available (i.e. for per-batch callbacks on the current batch, for other callbacks on all auxiliary datasets given). I thought of more granularity in the specifications, but I am afraid of their complexity.

I was thinking of leaving only the updates, but in the conversation about monitoring @bartvm suggested it would be better if the needed values were more explicitly asked.

Of course, you can move these to the training algorithm (e.g. add_updates, add_censor) and add_expresison to the main loop (or better?? a special extension that always runs first) that computes requested values. This approach is OK too with me.

rizar commented 9 years ago

A question for you guys: where should the number of iterations done be stored, in the main loop or in the log? In PyLearn2 epochs_seen was an attribute of the monitor. I hated this idea but now I hesitate, because it is very tempting to make the main loop stateless and push all the mutable information into the log. At least until the function pickling is solved by Theano developers it can be very helpful to have the state of the training process completely separate from the main loop.

janchorowski commented 9 years ago

With a more powerful log like you propose it probably should be in the log. A stateless main loop makes for easy saving and is tempting.

But there is more state than just the number of iterations and we have to store it somewhere too. We can have something like GH's state - a dict (or similar) that stores the current information about the training state (both main loop, datset, trainer, and stateful extensions). This mainly saves disk space as the old values need not be kept. Thus you could store heavy objects in it - like running sums for momentum/adadelta.

rizar commented 9 years ago

what updates does it want processed along with fprop, what changes it wants the make to parameter updates, and what values it needs computed

I understand why we need first, I do not want to have second, and I do not understand how third is different from first.

To clarify about monitoring: the monitoring extension can simply use an add_updates interface of the training algorithm to make it compute the shared variables of interest.

To clarify about censorship: that should be done by something like PyLearn2'sh LearningRule, i.e. a plugin to the training algorithm.

rizar commented 9 years ago

But there is more state than just the number of iterations and we have to store it somewhere too. We can have something like GH's state - a dict (or similar) that stores the current information about the training state (both main loop, datset, trainer, and stateful extensions). This mainly saves disk space as the old values need not be kept. Thus you could store heavy objects in it - like running sums for momentum/adadelta.

I was not offering to collect all training state in a single object, this is too much. The data stream, the training algorithm, the extensions: they all potentially possess their own states. But it seems to be nice to deprive the main loop of the state of its own.

janchorowski commented 9 years ago

Having an extension tell what values it needs is more than just specifying updates, because it works both for minibatch callbacks (when the values are computed, internally via the updates mechanism, on the minibatch) and for other callbacks (when the values are aggregated on other data sets). The context is used to differentiate between the various values.

Also, having extensions tell what they want is cleaner and was Bart convinced me to this earlier (https://github.com/bartvm/blocks/pull/85#issuecomment-67972775).

Example: an extension wants to know the "cost"

I agree, that sometimes the extension doesn't use the information it requested (e.g. it may only care about the "validation_set cost"). Still, I quite liked that in PL2 all monitors were computed on all datasets and what we do here is very similar, so I think more granularity is not necessary.

So maybe a good compromise is this:

  1. Extensions tell that they want (i.e. the third). Mainly because this is more explicit/clean. This value is computed by a special update that speaks to the training algorithm
  2. The trainer has an interface to add updates/censors. Then the special extension tasked with computing requirements for other ones speaks directly with the trainer.
janchorowski commented 9 years ago

But it seems to be nice to deprive the main loop of the state of its own.

Definitely!

bartvm commented 9 years ago

Personally I don't think the monitor should use an add_updates method of the training algorithm. A training algorithm IMHO is nothing more than a class which takes a cost and returns a set of updates and monitoring channels. It shouldn't be in charge of adding in updates provided by the monitoring channels; this introduces the kind of inter-library dependencies we were trying to avoid!

I agree with @rizar on not wanting a complicated querying system for training extensions. Especially if we want to allow for multiple logging backends and monitoring non-numerical values (I'm definitely not going to deal with the horror of storing serialized Python objects in databases.)

rizar commented 9 years ago

Personally I don't think the monitor should use an add_updates method of the training algorithm. A training algorithm IMHO is nothing more than a class which takes a cost and returns a set of updates and monitoring channels. It shouldn't be in charge of adding in updates provided by the monitoring channels; this introduces the kind of inter-library dependencies we were trying to avoid!

We have quite different understandings of what a training algorithm is. For me it is just a black box that takes a bunch of data and does something useful. If this is done by means of running a Theano computation, such a training algorithm can provide methods to add hooks to this computation (add_updates function). An extension can use this hook to compute faster what it wants (i.e. the gradient norm). I want to keep actual training logic hidden in the training algorithm and not moved upwards in the main loop.

rizar commented 9 years ago

The dependency scheme I am leaning to is the following: a main loop has a log, a model, a data stream and a trainer. The data stream and the model are fully isolated, in fact model will only be used by extensions. The trainer and the extensions are allowed to access the log. The extensions are also allowed to ask the trainer for a sort of favor. That's it.

janchorowski commented 9 years ago

I agree with @rizar on not wanting a complicated querying system for training extensions. Especially if we want to allow for multiple logging backends and monitoring non-numerical values (I'm definitely not going to deal with the horror of storing serialized Python objects in databases.)

Sorry, but I am confused - you speak about querying the log, or querying what the extension needs? If you speak about querying what the extension needs, then what I did in my branch is quite simple: the extensions gives a single list of theano variables it wants computed before any of its callbacks are run. These values can be anything.

Now the monitor is just a training extension - it asks that some variables be computed, it puts them into the log. It doesn't log everything that was computed for other extensions. Since extensions always request theano variables, if many extensions want the same thing, theano will compute it just once.

bartvm commented 9 years ago

I agree with that. What I understood from what you said, "the monitoring extension can simply use an add_updates interface of the training algorithm", is that the monitoring extension would e.g. ask the training algorithm to add the updates needed for e.g. the aggregation of monitoring channels to its updates. That I would disagree with.

We talked before about the "star-shaped dependency graph", which I think means that it's the main loop who asks the monitor what updates it needs performed, and it will ask the same thing to the training algorithm. It then combines these updates, compiles and executes the Theano function, and feeds the results back to the components that need it (e.g. the log by calling add_record).

bartvm commented 9 years ago

Note, that comment was in response to @rizar. In my reply to @janchorowski I was referring to querying the log; my bad. I agree that it would be nice for the monitor to just be an extension.

bartvm commented 9 years ago

@rizar I think I agree with your dependency scheme, although I am not sure about the "asking the training algorithm for a favour"-part. What kind of favour? I agree that it might be a good idea to let extensions modify the behaviour of the training algorithm directly, but they shouldn't be asking the training algorithm to add updates which have nothing to do with the training itself (see my previous https://github.com/bartvm/blocks/issues/87#issuecomment-69007775).

janchorowski commented 9 years ago

I think means that it's the main loop who asks the monitor what updates it needs performed, and it will ask the same thing to the training algorithm. It then combines these updates, compiles and executes the Theano function, and feeds the results back to the components that need it (e.g. the log by calling add_record).

This doesn't work with how trainng algos are done in groundhog - they run a theano function to compute the gradients, then change the parameters. It is for debugging (if you getr nans or similar you get to a breakpoint in the python code in between and you can run diagnostics).

So I guess the trainer should take a list of updates to run along the gradient computation call and a list of censors to modify the updates to the parameters. This can happen in a single theano call (as in PL2) or in two (as in GH).

rizar commented 9 years ago

@bartvm, we did talk about star-shaped dependency pattern but we should reasonable in trying to reach this goal. I think that making the main loop to combine updates, compile and execute the Theano function is too high of a cost. This is a big loss in terms of generality: what if my training algorithm involves some fragments of non-Theano code? I think we can compromise the independence a bit and permit extensions to access the training algorithm.

bartvm commented 9 years ago

Okay then. Do you think that we can standardize this interface over all training algorithms? An add_updates method might not make much sense if the training algorithm in question performs some sort of complicated procedure involving calls to multiple Theano functions. Which of these functions should then perform the updates? The other alternative is to make the interface different for different algorithms, and training extensions will be compatible with some training algorithms (but not necessarily all).

rizar commented 9 years ago

Yes, I agree that different training algorithms should have different interfaces and not all training algorithms will be compatible with all extensions. Specifically in the case of adding updates I think the convention should be the following: an algorithm can provide an add_updates method if there are Theano computations involved. When deciding to which function to add the updates the training algorithm must ensure that they are computed on the old values of parameters and that they are computed once, otherwise it's free to choose from considerations of saving computation.

rizar commented 9 years ago

@janchorowski, regarding the requests: I must say I am lost and I will just describe how I see it done. Most often there will be only two extensions writing to the log: one that commits values computed by the training algorithm and one that runs validation on auxiliary datasets. Those produce data, and all the rest can consume it: do early-stopping, adjust learning rate or whatever. I do not see any need to request anything from the main loop in the picture above.

And once again about censorship: I believe it should be done by training algorithm plugins (i.e. learning rules), not by main loop extensions.

janchorowski commented 9 years ago

OK, I feel we reach a conclusion.

I agree with @rizar, that training algorithms should expose an add_updates method with the proposed semantics.

The disagreement with @bartvm seems to stem from the different views of what a trainer does. I tend to side on this with @rizar. Then the trainer has to accept to do some non-training related computations for the extensions. And I would not assume that the trainer does just one theano function call.

While not a huge concern, I prefer that extensions say what they want computed before their callbacks are run, rather than having something that writes to a log and later they retrieve it from the log.

How I see it implemented: before starting the main loop, you scan all extensions for their requirements. Then you instantiate a special extension to be run before all others. This can be done in an auxiliary routine, which gives the final, ordered list of extensions to the main loop. This special extension:

  1. speaks to the trainer to add updates that compute requested theano variables for each minibatch
  2. knows about auxiliary datasets. For any non-minibatch related callbacks computes and aggregates all requested theano variables on all provided datasets. It then stores the values in a dict of dicts, first by dataset (context), then by theano variable.

All extensions then get this dict of dicts. The access what they want from it. They never need to know about other extensions. The values need not be serializable, small, etc.

Examples:

  1. Monitoring extension - asks about the value of monitors. These are all simple numerical values, to it can store them in the log, database...
  2. Early stopper - asks about the expression for the cost/misclassifications. Then uses it to raise a stop flag
  3. Assertion checking expression - asks about expressions, then sees if they are true/nonzero. Raises if not. Doesn't store anything in the log.

What we gain: 1. atomicity of extensions - I don't need to guess if the values I need will be computed by the monitor. 2. Separation between monitoring and other extensions. 3. The log is not polluted with non-monitoring stuff.

The dependencies are like that:

  1. Typical extension knows only about the expressions it wants.
  2. The special extension that computes values knows about the trainer (to add updates), the datasets and the aggregators.
  3. The early stopping extensions know about the cost expression and the main loop (needs to flag it to stop).
  4. The monitor knows about the log and about the theano expression it needs.
  5. The trainer knows that it has to add additional updates.
  6. The main loop knows that it has to run extension callbacks and the trainer.
bartvm commented 9 years ago

Having a single "special extension" that talks to the training algorithm assumes again that there is a common interface possible, which I'm not sure is the best road to go down. Also, with "special extension" you really mean the "monitor extension", right? Ideally the monitoring extension should be optional like all others.

Could you clarify how you imagine an extension should "declare" what it requires? More advanced extensions might need more complex logic to determine which variables to use. It might be easier if they just have access to the model/CG in that case.

janchorowski commented 9 years ago

No, the special extension is not the monitor. The monitor requests some values and stores them to a log, prints them etc. The special extension computes the value of requested theano variables before other extensions are run. This is a very common use case. You are correct, the special extension depends on the training algorithm, or rather on the ability of the training algorithm to run some auxiliary updates during fprop on minibatches.

I guess this is be the case for all stochastic descent algorithms. For other algorithms, like l-bfgs, even the notion of a per minibatch callback has little sense - they always process the whole data. So in this case there are no updates to be run, and no per-batch callbacks to be run. Only per-epoch callbacks make sense.

You can have no monitoring, but still do early stopping. The special extension is there jsut to relieve the main loop/trainer from computing stuff. And to be consistent - it computes things on minibatches and on full datasets. Please see ComputeExpressoins class in https://github.com/janchorowski/blocks/blob/future_model/blocks/training/__init__.py

Could you clarify how you imagine an extension should "declare" what it requires?

The extension provides a list of theano expressions. Yes, it is very/too simple. It covers all of use-cases I came up with and found in PL2 - validation, monitoring, assertions, early stopping etc. It will sometimes do too much work, because we really want to compute expressions on data, and here we compute them on all data available. I don't think this will be a huge problem - PL2 happily lives with that (all channels are computed on all datasets).

If an extension wants to do more - it's OK. But factoring out the evaluation of specific values is so common it makes sense to have it in one place.

bartvm commented 9 years ago

Closed via #92