mila-iqia / blocks

A Theano framework for building and training neural networks
Other
1.16k stars 351 forks source link

Check whether all bricks were initialized #347

Open bartvm opened 9 years ago

bartvm commented 9 years ago

Following the discussion at #337, it would be a good idea to add another check to see whether weights were initialized (besides #244). Namely, in the main loop we can simply walk the graph, retrieve the bricks, and check whether all of them were initialized. If that's not the case, we can raise a warning (just as is currently done when e.g. parameters don't match between the model and the algorithm).

rizar commented 9 years ago

That's really easy to do with a model, but you want it to be as optional as possible.

bartvm commented 9 years ago

Mm, the thing is, that although I see the usefulness of Model, for things like this, you shouldn't need to pass a Model instance, because all of this works perfectly fine by just walking the computation graph (using e.g. ComputationGraph) without making assumptions (like only having a single cost).

(I'm also not a fan of how Model is really just a ComputationGraph with a few assumptions built in/some extra features, but that the uses of the two are entirely different, but I haven't thought of a better alternative.)

rizar commented 9 years ago

But how do you want to get this computation graph? algorithm.cost?

bartvm commented 9 years ago

Well, that's why I said I didn't have a better solution :) because I don't want to rely on algorithm having a particular interface either. But my thinking is generally more supportive of not passing a Model class, but somehow passing an annotated Theano graph instead e.g. something like cg=ComputationGraph(cost) instead of model=Model(cost) and moving get_parameters from Model to something like ComputationGraph.get_parameters(brick_naming=True).

One of the reasons is because I've had several people be really confused about the Model class, asking me whether they needed to subclass it, etc. When I tell them it's really just providing a way to rename parameters, and tell extensions what the cost is, and that you could just as well pass None in many cases, they're even more confused. All in all, I think the Model abstraction is basically just ComputationGraph + assumption that there are bricks and a single objective, so I wonder if it's not possible to just use ComputationGraph, an OBJECTIVE variable role, and a get_parameters(brick_naming=True) method. If that is the case, it's one very confusing abstraction less, and simplifcation is always a good thing. Then again, as I said, I haven't thought this through too much (assignment due in a few hours!).

rizar commented 9 years ago

Okay, it is not nice that our understanding of what Model is are different. Let me try to explain once more why I came to the thought that we need such a thing.

Terminology

When I say "model" (lower-cased), I mean everything that the user created in order to solve his task: a complicated structure made from annotated Theano variables, application calls, application, bricks. I reserve a title-cased "Model" for the concrete component I propose to add.

Motivation

Extensions need access to the model (note, lower-cased). The sanity check we are discussing here is, in fact, an extension, and this extension, for instance, needs a list of parameters at least. A better implementation will also need access to bricks, because the report that a parameter is not initialized is much more helpful when in addition a path from a top brick to this parameter is provided.

Another example: consider an extension that prints interactively in a browser window the tree of your bricks and parameters and uses a color scheme to show average magnitudes of the parameters. It needs access to bricks and parameters as well.

One can even imagine an extension whose on_error callback identifies gradient with the respect to which of the annotated variables became NaN first. Here you need access to the computation graph with application calls and the bricks.

To sum up, Blocks with their rich introspection techniques and modularity can visualize deep learning and make it transparent. For this extensions need convenient access to everything the user defined, and the question is, who should provide it.

to be continued...

rizar commented 9 years ago

Why not Computation Graph

I guess both of us and our new users as well thought: "why not replace the model argument with the cg argument and give extensions access to the computation graph"? That is MainLoop(cg=ComputationGraph(cost), algorithm, log, ...). I have several reasons why I do not like it.

  1. "Do one thing and do it well". Currently ComputationGraph knows nothing about bricks, and it makes it much simpler. It manages a computation graph with some abstract annotations. I would like to keep it this way.

    This is a wide-spread design, to add functionality in a few steps. Consider the way Ubuntu manages packages: first goes dpkg and then goes apt. For the first dependencies are just annotations, it can only install a downloaded packages. The second takes care of calling dpkg the right way, handles dependencies automatically and works with a package repository. Similarly, the computation graph only stores annotations, something on top of it should actually interpret them.

  2. No chance to interfere. If the ComputationGraph supports a get_bricks method, it must always return all bricks it extracted by the graph traversal. It would be very weird to give the ComputationGraph an optional argument bricks that would override this logic. The same thing for application calls and parameters, although for that latter we have roles to deal with that.

Why not Set of Functions

One could have all this functionality distributed across a set of functions like get_bricks, get_application_calls, etc.. Apart from the fact that I do not find it a beautiful solution, the issue (2) from the list above is more than applicable.

to be continued ...

rizar commented 9 years ago

Model

For the reasons outlined above I think we need another entity which will answer all the questions the extensions might have about the model. I find it natural to call it Model. In addition the user which want to play with a saved model typically has similar needs to the extensions. So the second purpose for the Model is to be unpickled and explored.

In the 99% frequent-case Model is "computation graph + selector + set of keyword arguments to override whatever the first two return". Perhaps I was when I tried to factor out a generic AbstractModel interface: it is too early for that, and it is not clear if Blocks will ever be used to host a non-Theano algorithm. But apart from that, I think Model is a really nice thing to have.

So I would propose a few things

bartvm commented 9 years ago

When I say "model" (lower-cased), I mean everything that the user created in order to solve his task: a complicated structure made from annotated Theano variables, application calls, application, bricks.

Nearly the exact same description applies to the computation graph: A collection of Theano variables, application calls, applications and bricks (all of which are annotations read by the computation graph). The only difference is that the computation graph doesn't distinguish between the kinds of annotations, while Model does.

"Do one thing and do it well"

The original quote is: "Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new features." This is exactly what the Model class isn't \ doing; it's adding new features to the ComputationGraph (which it inherits from), but instead of specializing (i.e. subclassing) it mixes it in with a bunch of new features.

ComputationGraph knows nothing about bricks, and it makes it much simpler

Agree. However, this doesn't stop us from introducing an extended, brick-aware ComputationGraph subclass instead of a new abstraction.

No chance to interfere [...]

I don't understand this paragraph. I don't understand why ComputationGraph would need a bricks argument, nor why Model would. Aren't both Model and ComputationGraph constructed from walking a graph defined by Theano output variables?

Why not Set of Functions

Completely agree.

In the 99% frequent-case Model is "computation graph + selector + set of keyword arguments to override whatever the first two return". [...] it is not clear if Blocks will ever be used to host a non-Theano algorithm.

Blocks is a Theano-toolkit. Trying to abstract the Theano away, pretending that you could wrap anything else, is beyond the scope of the framework IMHO and is actually one of the things that made Pylearn2 so much more complex.

If we accept that the model is an annotated Theano graph (which is how you described it in the first place) shows that introducing the Model abstraction is awkward. Most of the logic is already in ComputationGraph. As I said, we are really just looking for a bricks-aware computation graph, which is different from introducing a new abstraction altogether.

making Model mandatory to make things simple and to make nice sanity check extensions

I have no real opposition to making a Model-like object compulsory. But, more than ever, that means that it should introduce no restrictions e.g. anything that works for a ComputationGraph must work for a Model--or BrickComputationGraph, which is how I would then like to think of it.

Let's ignore the terminology Model for now, which I find confusing because it suggests a level of abstraction that I like to avoid, and I think it clutters the discussion. What I think we should introduce is a BrickComputationGraph subclass which implements an awareness of Brick, Application and ApplicationCall (which is effectively what Model has done, but while pretending not to be a ComputationGraph).

In practice, it changes very little. It has almost exactly the same functionality, but without introducing a new abstraction, and is more in spirit of the "Do one thing well"-philosophy (instead of inheriting from both ComputationGraph and AbstractModel, mixing two feature sets). If a user doesn't use extensions that require Brick-awareness, they can just use a ComputationGraph. This means that ComputationGraph and BrickComputationGraph should be interchangeable in many cases--which in fact, Model and ComputationGraph are right now as well; both return you parameters, both could return you the objective variables, etc. The incentive for using BrickComputationGraph is simply that some extensions might need the bricks-awareness to do their job e.g. like Dump does right now to get parameter named using the brick hierarchy.

I don't think the resulting workflow would be a bad one. You've been suggesting things like [brick.initialize() for brick in model.get_top_bricks()], which I think are reasonable ideas. However, I find it awkward if this kind of logic is mixed in with ComputationGraph logic. You'd be dealing with two completely different abstractions one line after the other, while the distinction often seems arbitrary to users. Do I use model.get_parameters() or cg.params? Do I use VariableFilter(roles=[OBJECTIVE])(cg.shared_variabeles) (probably needs a shortcut like cg.objectives) or do I use model.get_objectives()? Making one a specialization of the other removes this arbitrariness.

rizar commented 9 years ago

No chance to interfere [...]

I don't understand this paragraph.

What I mean is that I want to be able to override the default results of bricks and parameter extraction. A real example: suppose I am doing caption generation experiments with a pretrained convnet as a feature extractor. At training time I use the convnet features simply as input, and for this reason there is no reference to the convolutional bricks in the computation graph. But at test time I will need them to work with new images. It would be sad to find at this point, that these bricks were not saved. Also these bricks would not be accessible to extensions during training.

So I would like to do like this:

model = Model(cost)
model.bricks += convolutional_bricks

and have all my extensions use these new bricks I pushed into the model.

I also tried to argue that such a feature would look awkward in a hypothetic BrickComputationGraph. In addition this name is too long and puts all emphasis on computation graph, whereas for me the forest of bricks is an equally important component.

I understand your motivation to keep things simple and to keep number of abstractions small. So here goes my proposal: let''s get rid of the AbstractModel but keep the Model, rebranding it like this:

class Model(object):
    """Combines a computation graph and optionally a brick hierarchy.

    bla bla bla
    """
    pass

If you have a name in mind that would be less dangerous than model and less verbose than BricksComputationGraph, I am open to suggestions.

bartvm commented 9 years ago

I'm not a fan of the name BrickComputationGraph either, I just coined it to highlight how I think Model should be interpreted: An optional specialization of the computation graph that extends functionality through an awareness of bricks. I am fine with the name Model, but defined as a full child class of ComputationGraph as I proposed, whose added features are optional.

The example you give seems to just be ability to provide extensions with information beyond what is normally provided by the main loop. However, I don't see why this would benefit from a Model abstraction that allows us to override attributes to the point where all semantics are lost. The bricks attribute in your example now represents bricks of the model, plus a random collection of bricks that only make sense to extensions that know specifically how to deal with them. The rest of the extensions will just be utterly confused, because they could (quite rightly) assume that the bricks given to them by the "model" are part of the model being optimized.

Your convnet is pre-trained, so most likely it's already saved elsewhere. If not, why not save it separately? Or save it as part of an extension? Or just load it manually?

y = tensor.lvector('image_classes')
image_input = tensor.tensor4('images')

# Construct model
convnet = ...
y_hat = convnet.apply(image_input)
cost = CategoricalCrossEntropy(y, y_hat)

# Train convnet model here, can checkpoint/dump here

cg = ComputationGraph(cost)
features, = VariableFilter(roles=[OUTPUT], bricks=[convnet.layers[4])(cg.variables)

# Create features for caption generation here and save them to disk

# Save model for easy access if you need to, includes bricks through annotations
pickle.dump([image_input, features], open('convnet.pkl', 'wb'))

Then training your caption generation

feature_input = tensor.matrix('features')
cost = my_model.apply(feature_input)

# If you want an extension to have access to the pre-trained convnet directly
# why not just load it and pass it directly?
pretrained_convnet = pickle.load('convnet.pkl')
# Or if you want to load it from the main loop instead
pretrained_convnet = pickle.load('convnet_checkpoint.pkl').model (.brick_computation_graph)
extensions = [CustomExtension(pretrained_convnet)]
# Now CustomExtension can do whatever it wants with the pretrained convnet
# Finetuning

image_input, features = pickle.load('convnet.pkl')

# Alternatively, use the checkpointed "model"
# (which is a computation graph with extra options such as top_bricks)
pretrained_convnet = pickle.load('convnet_checkpoint.pkl').model
image_input, = pretrained_convnet.inputs
features = VariableFilter(roles=[OUTPUT], bricks=[pretrained_convnet.top_bricks[0].layers[4])(pretrained_convnet.variables)

# Load caption generation model and define new final cost to finetune

To summarize, here is my current understanding of things that you are proposing a model abstraction would solve:

rizar commented 9 years ago

The example you give seems to just be ability to provide extensions with information beyond what is normally provided by the main loop.

Main loop provides no information to the extensions at all. The question of this discussion is how to give extensions such access. And my argument is that I want full control of this access. I do not want to hardcode the assumption that extensions should work only with the parameters and bricks extracted by Theano graph traversal. This comes at very low cost, but makes me calm about the future.

bartvm commented 9 years ago

Like I said, I think that is a valid use case, but there are millions of things an extension could need access to. The question then is, should there be a central object that all of these extensions use? Or should the extensions that need some particular value just take this object as an argument?

My two objections to your proposal of doing it through a shared object are (a) the impossibility to provide a robust interface to an object that can provide anything to any extension; and (b) the fact that this functionality shouldn't be mashed in with an object that extends the features of the computation graph by interpreting the brick hierarchy.

I'm not convinced that all cases where extensions need a particular piece of information, they can't simply be provided with this piece of information through their arguments. This is what we do all the time after all: The algorithm receives the costs and parameter as arguments, the monitoring extensions get passed explicitly what variables and data stream it should use, and if you have an extension that uses an entire convnet, more than ever should it be explicitly passed as an argument instead of being made part of some global all-knowing object that needs to know how to cater to the needs of every extension out there.

rizar commented 9 years ago

I understand your concern that we can not build a single interface that would cater all the extensions one can imagine. When an extension needs something peculiar, the recommended way to provide this something to it is by its own argument. Or a person can have his own Blocks fork where all extension need this peculiar entity and a PeculiarModel, subclass of Model provides this entity to all the extensions.

But neither do I think that basic things such as the bricks or the application calls or the cost variable, if present, should every time be given explicitly. There should be reasonable defaults.

So my answer to (a) is: let's select a number of key requests an extension might make and support only those. For instance cost (optionally, only if there is a single minimized Theano cost), parameters, bricks. We can move the current naming logic from Model to blocks/dump.py, as there are no other consumers so far.

My answer to (b): technically you could have both BricksComputationGraph, that interprets annotations of a ComputationGraph as a brick hierarchy, and a Model, which will do almost nothing, only serving as a place for substitutions, but I guess it is an overkill. All that we have in Blocks is the annotated Theano graph + bricks, which I think justifies the name Model for the component that encapsulates both these things.

bartvm commented 9 years ago

Regarding (a) I completely agree that there should be some default things that can be requested, but I am arguing that a BrickComputationGraph-type object could provide all of these. It could use bricks to provide unique names for parameters, and returning the cost is just a matter of returning variables tagged with OBJECTIVE. Both of these are well within the mandate of BrickComputationGraph. Extensions that expect there to be one cost would just do objective, = brick_computation_graph.objectives and would break down in the scenario where there are multiple costs, without hardcoding the requirement for one cost in the BrickComputationGraph.

For (b) it is still unclear to me what a Model should provide that a BrickComputationGraph should not in your distinction. I don't think there is a need for the single-cost assumption (as I said above), and am wondering what else remains that a BrickComputationGraph couldn't provide. (I am unclear about what you mean by "place for substitutions".)

rizar commented 9 years ago

(a) I agree that it could provide these things. The only thing I miss is the ability to tell it to ignore its findings and use the values I provide instead of them. This is exactly what I call the place for substitutions, which also answers your (b).

Another example why such substitutions might be necessary: there is no single "true" way of selecting parameters. cg.parameters and ch.shared_variables seem to cover all cases, but which one should we use as the default? And there is more than one extension that would like to know what parameters are: Dump and DefaultMonitoring at least. I think they should request them from model. That's why we need a keyword parameters argument for the model that tells it what are the actual parameters. Such an argument is slightly out of the scope of BrickComputationGraph semantics.

bartvm commented 9 years ago

Okay, that's a use case I understand. It still seems like overkill though to start an entirely new class that lies outside of the ComputationGraph hierarchy. Especially because I would like ComputationGraph and Model to be interchangeable in many cases, and don't want to duplicate features. To me, it seems that what we really want is a mutable BrickComputationGraph i.e. one where the attributes can't just be read, but also set. That is, a hierarchy which would go like

ComputationGraph > BrickComputationGraph > MutableBrickComputationGraph (= Model)

So there is ComputationGraph.get_parameters(), there is e.g. ComputationGraph.get_parameters(unique_names=False), and there could be Model.get_parameters(unique_names) which also comes with Model.set_parameters().

As an extension that doesn't care, I will just call model.get_parameters(), and even if the user passed a ComputationGraph instance, everything works just fine. It's only when extensions start using the advanced arguments and methods provided by the subclasses that an error would occur e.g. Dump would crash when passed a ComputationGraph, but would be happy with a BrickComputationGraph. If the user wants to override the parameters, objectives, etc. he/she would need to use Model.

Extensions could easily check whether the model given to them is a BrickComputationGraph (and not ComputationGraph), and raise an error saying something like "This extension relies on the advanced features provided by brick annotations. It cannot be used on raw Theano graphs." Meanwhile, most extensions will happily work on a ComputationGraph, as long as they can call get_parameters.

The major advantage in my eyes is that we (a) share a maximum amount of logic between Model and ComputationGraph, because the former is really just a mutable, slightly cleverer version of the latter; and (b) we get rid of another abstraction.

rizar commented 9 years ago

It still seems like overkill though to start an entirely new class that lies outside of the ComputationGraph hierarchy

That makes certain sense, and Model is already a subclass of ComputationGraph. But below I have an argument against it.

there is e.g. ComputationGraph.get_parameters(unique_names=False)

I believe you meant BrickComputationGraph?

As an extension that doesn't care, I will just call model.get_parameters()

Do you want to add the get_parameters method to the ComputationGraph? That would conflict with the current parameters property, let alone the fact that its interface is almost property only.

(b) we get rid of another abstraction.

I do not see where we get rid of an abstraction, because MutableComputationGraph aka Model is already rather abstract.

Your plan is quite good, but there a few points which I would like to criticize:

I would rather go for the following scheme: BrickComputationGraph and a light-weight Model which aggregates a BrickComputationGraph object:

class Model(object):
    """Interface for main loop extension to get cost, parameters, bricks.

    Attributes
    ------------
    cg : BrickComputationGraph

    """
    def __init__(self, outputs):
        self.cg = BrickComputaionGraph(outputs)

    def get_parameters(self):
         if not hasattr(self, '_params'):
             return cg.parameters
         return self._params             

    def set_parameters(self, params):
        self._params = params

    ...

The extensions would be expected to ask model when possible and model.cg for advanced things that Model does not cater.

bartvm commented 9 years ago

Do you want to add the get_parameters method to the ComputationGraph? That would conflict with the current parameters property, let alone the fact that its interface is almost property only.

Not add, it would simply replace the parameters property. It's only 4 extra characters.

I do not see where we get rid of an abstraction, because MutableComputationGraph aka Model is already rather abstract.

We got rid of an abstraction in the sense that Model will be, both in code and in principle, a mutable ComputationGraph that is brick-aware. This means that in 90% of the cases Model and ComputationGraph are completely interchangeable. This is different from having an entirely separate Model class which is used differently (e.g. model.cg.parameters vs. model.get_parameters()).

if we force Model to be a descendant of CG, we might face questions "should I use model.parameters or model.get_parameters()".

As mentioned, I would just replace model.parameters with get_parameters. I prefer having Model as a full subclass of ComputationGraph instead of creating a new class, at the cost of turning a property into a method this is worthwhile for me.

you wish so desperately to pass pure ComputationGraph to the main loop that are you ready to push a higher-level method get_parameters down the hierarchy.

get_parameters isn't much higher-level than parameters for a ComputationGraph object, the only difference is the @property decorator and 4 extra characters!

In fact the is relationship between Model and CG does not fully hold as soon set_parameters enters the game.

I don't fully understand this. Model is simply a subclass of ComputationGraph that implements a set_parameters class. They still share an interface, and hence are interchangeable in 90% of the cases, which is the important part. This means that a large part of framework still works without needing a Model class and hence without needing BrickComputationGraph. If you want to make Model compulsory, this is a massive plus. It makes the code a lot more portable simply by not indirectly forcing people to have their computation graph interpreted as if it had bricks, even when it doesn't.

So, my proposal would be more along the lines of:

class ComputationGraph(object):
    def get_parameters(self):
         return VariableFilter(roles=[PARAMETER])(self.shared_variables)

class BrickComputationGraph(ComputationGraph):
    def get_parameters(self, as_dict=False):
        parameters = super(BrickComputationGraph, self).get_parameters()
        if as_dict:
             return OrderedDict(...)
        return parameters

class Model(BrickComputationGraph):
    def get_parameters(self, **kwargs):
        if hasattr(self, '_parameters'):
            return self._parameters
        return super(Model, self).get_parameters(**kwargs)

    def set_parameters(self, parameters):
        self._parameters = parameters

This means that, if we want to, we can remove the need for users to interact with ComputationGraph entirely. They could just use Model in all those situations, no need to juggle ComputationGraph and Model throughout. There is also only one interface. No need to wonder when to user model.cg.parameters and when to use model.get_parameters().

From a code perspective, a user that wants to use Blocks' main loop but doesn't want to use bricks, can simply use ComputationGraph and will basically have no imports or any links to blocks.bricks to deal with. I know that your proposed Model would work on CGs without bricks, but code-wise the dependency is definitely there!

rizar commented 9 years ago

Okay, @bartvm , I understand what you propose, and I more or less like it, but for one main point - I do not agree that relationship between Model and BrickComputationGraph can be described by the word "is", that is that the former can be a descendant of the latter. You propose to achieve it by forcing the ComputationGraph to support the Model interface (whose existence you do not acknowledge, but which is get_objective, get_parameters, get_bricks), forcing it to have a method get_parameters which will look awkward when the rest is properties.

For a user that wants to use Blocks but doesn't want to use bricks the Model can as well ComputationGraph as its attribute. The get_bricks method can looks as follows:

class Model(object):

     def get_bricks(self):
          if hasattr(self, '_bricks'):
              return self._bricks
          if isinstance(self.cg, BrickComputationGraph):
              return self.cg.top_bricks
          return [] # or raise ValueError          
rizar commented 9 years ago

There is something more to say here: Model we discuss now is primitive, because there is very little in common between all various models that Blocks may host. However, when I am doing a concrete experiment, I would like to have more functionality in my Model subclass. For instance now I am working on speech recognition and I know what my top bricks are and how to use them. It is convenient for me to have a class PhonemeRecognizer that also knows what top variables are and can create computation graphs with them. This way this logic is not scattered across 10 scripts that I use for model analysis.

class PhonemeRecognizer(object):

    def __init__(self, top_bricks):
        # create all input variables

    def build_cost_graph(self):
         ...

    def build_generate_graph(self):
         ...  

I would like to inherit this class from Model, which makes sense I think. This way all extensions can use the advanced methods my extended model provides, for instance they can ask it to sample phoneme transcriptions for a few given utterances from the validation set. But if we go your way, I can't do that because I will need to specify cost the creation time, which is not always possible:

recognizer = PhonemeRecognizer(top_bricks)
basic_cost = recognizer.build_cost_graph()
regularized_cost = apply_dropout(basic_cost, ...)
recognizer.set_objective(regularized_cost) # the point at which computation graph is known

I am not saying that things like this should be a part of Blocks, a core library. But when you make Model be a ComputationGraph (how it is now) inheriting from this class becomes problematic.

rizar commented 9 years ago

Just an update: I found it quite convenient to put the functionality that I wanted to put in the Model descendants, in the top brick. This can have input variables, compile beam search, and do whatever one wants.

This means that I care much less now if you @bartvm don't want to have a new Model hierarchy. My extensions will simply do top_brick, = self.main_loop.get_top_bricks().