mila-iqia / blocks

A Theano framework for building and training neural networks
Other
1.16k stars 351 forks source link

Bricks initialization: discussion and possible improvements #337

Closed fvisin closed 9 years ago

fvisin commented 9 years ago

I'd like to discuss two potential issues of the current bricks initialization:

  1. I find somewhat redundant and confusing that a brick should be initialized even when initialization arguments are given in the constructor. I would prefer the following behaviour:
    • all the initialization argument passed to the constructor: the brick object quietly calles initialize() right before the first time it is "used". This could be achieved with an _initialized flag set in the object.
    • some initialization arguments are not passed to the constructor: the brick object should be either explicitly initialized later calling initialize(args_dict), where args_dict is a dictionary of parameters and values, or Block will try to automatically initialize it by exploiting the nested bricks hierarchy.
  2. In my opinion setting the arguments to 0 by default and initialize them only when initialize() is called, is potentially very harmful because it can lead to silent failures, i.e. to the model running without raising errors but with wrong arguments. In my opinion it would be preferable to fail when the brick has not been either explicitly or automatically (nested bricks scenario) initialized.
bartvm commented 9 years ago

I agree in principal with these issues, but the decision was definitely made keeping these things in mind. One way we hope to alleviate "silent" failing is through https://github.com/bartvm/blocks/issues/244, using NaN instead of zero should make it clearer that you forgot to initialize, because everything (including the cost) will have NaN as a value before training even started.

The main problem is this: You strictly speaking don't know whether you can initialize or not. You don't know what arguments are needed; right now, it's almost always just weights_init and biases_init, but in principle the initialization process could rely on a range of other settings.

So the only way to initialize "by default" would be to use a try and except block. That's messy: You could be sampling from the RNG, so initialization could become irreproducible.

I'm not sure what you mean by "exploiting the nested bricks hierarchy".

A more comprehensive solution to this would be to implement what is discussed in https://github.com/bartvm/blocks/issues/31. If we actually define in the signature which values are needed for which part of the life cycle, it would become possible to check if all necessary arguments for allocation/initialization were provided.

bartvm commented 9 years ago

Another issue that comes to mind with your approach: It can be confusing that initialize needs to be called in some cases, and not in others. (Especially if people are programmatically passing arguments.) The question then becomes what a second call to initialize should result in: If you re-initialize, the user might be overriding the old initialization unknowingly (causing problems with reproducibility). If you don't re-initialize, it becomes impossible for the user to change the initialization settings and re-initialize.

This could potentially be solved by raising an error if initialization gets called for the second time, requiring the user to pass a keyword argument force=True in the case he really wants to re-initialize.

@rizar I'm starting to wonder if maybe we should make #31 part of 0.1. One the one hand, it's not strictly necessary for the functioning of the framework, on the other hand this initialization thing seems to be a pretty big pitfall. (I've forgotten to initialize a few times myself as well.) The only proper solution I see though is through fixing #31.

fvisin commented 9 years ago

With "exploiting the nested bricks hierarchy" I referred to the case in which the initialization parameter (e.g. the dimensions) are set in the child by the father, so that you don't need to specify the input/output argument twice.

If I am not wrong, your problem is that to support lazy initialization and "automatic initialization" you have to allow the brick to be allocated even if partially (or not) initialized. Each argument should be initialized at least once either at construction time, or somewhere between the construction of the brick and its "usage" . My idea of the initialize function is as follows: .

For what concerns the default "empty initialization value", None is way better then 0, even if I would like better an ad-hoc argument (e.g. a custom InitializationMissing object).

fvisin commented 9 years ago

Sorry in the first point of the initialize function I meant the apply function will quietly call initialize.

bartvm commented 9 years ago

On Feb 23, 2015 4:50 PM, "Francesco" notifications@github.com wrote:

With "exploiting the nested bricks hierarchy" I referred to the case in which the initialization parameter (e.g. the dimensions) are set in the child by the father, so that you don't need to specify the input/output argument twice.

This is handled by the push methods already, and shouldn't require any modification.

If I am not wrong, your problem is that to support lazy initialization and "automatic initialization" you have to allow the brick to be allocated even if partially (or not) initialized. Each argument should be initialized at least once either at construction time, or somewhere between the construction of the brick and its "usage" . My idea of the initialize function is as follows: .

If you pass a value in the constructor, the constructor will quietly call initialize with the provided value.

As I mentioned, I agree that this could work, but it would require knowing which values matter to initialization. For that, the mentioned issue needs to be addressed.

If you haven't initialized a parameter at construction time, you have to call initialize later. In this case, if you try to initialize a variable that was already initialized, a warning can be raised.

A warning won't cut it in an automated script. I prefer my proposal of raising an error.

The only exception to this second point is for parameters that can be inferred by the hierarchy of bricks (basically in/out dimensions).

For what concerns the default "empty initialization value", None is way better then 0, even if I would like better an ad-hoc argument (e.g. a custom InitializationMissing object).

It's a Theano shared array, so you can only set it to a number, infinity, or NaN.

— Reply to this email directly or view it on GitHub.

bartvm commented 9 years ago

Gmail really messed that up... I'll answer again when I get home! Either way, the crux of the problem is that right now there is no way of knowing whether or not the user passed the necessary arguments. Without that, this plan doesn't work.

fvisin commented 9 years ago

No problem. :)

As I mentioned, I agree that this could work, but it would require knowing which values matter to initialization. For that, the mentioned issue needs to be addressed.

Oh, ok. Now I see your point. To get around it a lazy allocation might work. Allocation "links" inputs and outputs. You don't need to actually allocate them in the exact moment you call Brick.apply(). You can store them in the Brick object and allocate the shared variables right before using the object.

This would allow you to set the default values to a custom InitializationMissing object (or whatever you want) and check at usage time if everything you need has been initialized or not (at this point you should be able to tell, right?).

The brick lifecycle would go from: configuration --> allocation --> initialization --> usage to (round brackets mean optional): configuration (+ partial or full initialization) --> (initialization) --> (initialization) --> ... --> Brick.apply --> (initialization) --> ... --> usage with implicit allocation

A warning won't cut it in an automated script. I prefer my proposal of raising an error.

A warning would make sense in the proposed lifecycle, since you could set the same parameter multiple times if you want to.

bartvm commented 9 years ago

I'm lost in several places I'm afraid. To make sure we're talking about the same thing, have you read the documentation on the current life-cycle?

Inputs and outputs of Brick.apply are linked through parameters. That is, you must allocate the parameters before calling Brick.apply. If I don't have W and b, how am I going to return y from linear.apply(x)? Simply put, Brick.apply creates part of a the computation graph, for this, we need all the variables, including parameters, to exist. So the life-cycle must involve:

configuration -> allocation -> Brick.apply

Also, keep in mind that there is no "usage" step at which we can run code. After a user calls Brick.apply, the brick could basically be thrown away. Usually a user will call Brick.initialize, but he doesn't need to. A user should be free to initialize the parameters as he please without calling Brick.initialize (maybe they want to do something much more complicated than our initialization schemes support).

I am still not sure what you mean with the custom InitializationMissing. Initialization is nothing but setting values on Theano shared variables, so we only have the options 0, Inf or NaN.

Allowing initialization to occur many times is dangerous, because a user thinks he might be setting the parameter values, while he is actually overriding them. Currently this isn't an issue, because you know that you must make at least one call to initialize. If initialization happens transparently, then this does become an issue.

rizar commented 9 years ago

I will join this lively discussion:)

First, @bartvm is right: for technical reasons we can not support fake uninitialized parameters like InitializationMissing, because the uninitialized parameters must be shared variables of the proper type in order for Bricks to be as flexible as they currently are.

31 was a long discussion, and I am not surely which idea of those mentioned there you want to implement. I am not a big fan of the idea to have a list of parameters required for initialization: this belongs to the realm of automagical things that I do not associate with our "lean and mean" motto.

But the issues you are talking about do exist. The very first two items of @fvisin's address two issues that we currently have: redundancy and lack of safety.

At this point I am not sure that a simple solution that would suit all users exist. But we could support a few models and lets the user choose the one he prefers:

(1) Explicit mode.

Just like now, but with NaN's instead of zeros in uninitialized parameters. One must initialize all bricks which introduces certain redundancy, but this is a moderate price for flexibility. I personally like the following idiom:

model = Model(cost)
for brick in model.get_top_bricks():
    brick.initialize()

(2) "On apply" mode.

We had this idea at an early state of development: initialization can be triggered when a brick is first applied. It can go with a flag indicating that initialization has already been done to prevent overwriting.

(3) "On construct" mode.

Initialization is run right after the constructor. I guess most of people coming from PyLearn2 would prefer it to be done this way. Perfect safety, zero redundancy, but not so much flexibility. We wanted to have it by switching Bricks.lazy to True, but looks like this is only mentioned in the docs.

The mode can be chosen in blocks.config. I guess most people would use (3), so this can be made the default.

bartvm commented 9 years ago

I am talking about using what you referred to as NoneI and NoneA; using those would tell you exactly what parameters are needed for particular stage in the life-cycle. I'm not 100% sold on the idea though, I tend to agree that logic like "initialize if (a) otherwise do (b)" is a bit too complex.

(3) I guess you mean False? As a default this doesn't work in many cases though. There are many bricks which even have in their docstring say: Works with lazy only. Having a framework which has failing bricks by default doesn't seem like a good option.

(2) Would work by default I guess. The flag already exists (initialized). This was actually what we used to do in the beginning, but we stopped doing it I guess because we didn't like forcing the user to provide initialization schemes in order to create a computation graph I guess. If we make this configurable with a flag however, it seems reasonable. I would prefer raising an error and supporting the force argument though. Otherwise you're going to have one of these two situations:

a. I developed using initialize_on_apply = True, but now realize that there is one brick I want to initialize differently. I need to start my interpreter session from scratch to change the configuration of this particular brick. b. Re-initialization is allowed, so accidentally I've been re-initializing instead of initializing.

So to make sure we all understand the same thing, I'm imagining it like this:

>>> config.lazy_initialize  # The default
False
>>> Linear(input_dim=10, output_dim=5).apply(x)  # Doesn't work, because initialization failed
AttributeError: ... 
>>> Linear(input_dim=10, output_dim=5, weights_init=IsotropicGaussian(),
...        biases_init=Constant(0)).apply(x)  # This works just fine
linear_apply_output
>>> config.lazy_initialize = True  # This would revert back to the current behaviour
>>> Linear(input_dim=10, output_dim=5).apply(x)  # The values of the parameters are set to NaN
y_apply_output

(1) Although I don't mind the manual initialization, I don't like the idea of promoting a method which relies on the Model class. I know that it can be nice to work with, but we shouldn't rely on for something as fundamental as this. (I know we don't "rely" on it, but we shouldn't confuse people by making them think they need it.)

fvisin commented 9 years ago

Welcome @rizar :)

I am not sure if what I am proposing is unfeasible or I am not able to explain it in a comprehensible way. At this point, I am inclined to say both of them!

Inputs and outputs of Brick.apply are linked through parameters. That is, you must allocate the parameters before calling Brick.apply. If I don't have W and b, how am I going to return y from linear.apply(x)? Simply put, Brick.apply creates part of a the computation graph, for this, we need all the variables, including parameters, to exist.

I guess what I am proposing can be summarized in the following questions: is it possible to delay the allocation of the computational graph? Can you pretend you allocated the computational graph when you call apply() and actually allocate it later? How far can you delay it? When is the first moment you actually need the computational graph to be allocated?

fvisin commented 9 years ago

What I mean by pretend you allocated the computational graph is: can you output a shared variable that is not part of the computational graph yet, keep a reference to it in the object and, in a second moment, build the computational graph using this reference and the parameters?

bartvm commented 9 years ago

No, there's no point anymore after calling apply in which we can intervene. The idea is that once a user calls apply, he receives a Theano graph, after which there is no expectation that he uses any other part of the framework (like the main loop). Hence, calling apply must return the final graph.

Besides that though, giving references to fake Theano variables which are magically replaced at a later date sounds like black magic that i am worried would make for a pretty brittle framework. On Feb 24, 2015 10:21 AM, "Francesco" notifications@github.com wrote:

What I mean by pretend you allocated the computational graph is: can you output a shared variable that is not part of the computational graph yet, keep a reference to it in the object and, in a second moment, build the computational graph using this reference and the parameters?

— Reply to this email directly or view it on GitHub https://github.com/bartvm/blocks/issues/337#issuecomment-75775812.

fvisin commented 9 years ago

Ok, in this case my idea doesn't make sense. :) Sorry for not being more clear from the beginning.

Let me check if I understood the terminology you are using:

These definitions could be used for https://github.com/bartvm/blocks/issues/338. Please correct content errors, if any, and feel free to rephrase the sentences to improve them if needed.

Basing on the assumption that the previous definitions are correct:

  1. Initialization is a misleading name, as (at least to me) the initialization is performed at allocation time. You might call it Value push to be consistent with the nested objects scenario (if I am not wrong! :D).
  2. You could keep trace of the state of the theano variables (initialized=true/false) in the Brick object and raise an error in the main loop if they have not been initialized. This wouldn't help a user that uses the theano graph outside of the framework, but I guess that in the hybrid Blocks+vanilla theano scenario it can be acceptable to assume that the user should pay more attention. EDIT: or alternatively could you annotate the Theano variables accordingly?
bartvm commented 9 years ago

All those definitions seem correct to me!

I wouldn't call initialisation misleading, but I can understand that it could be confusing. It refers to what is often called the "initialisation scheme" i.e. sparse, uniform, Gaussian, etc. with particular settings. While confusing in the context of programming (i.e. we're not initialising an object) it makes sense in the context of neural networks I think. I'm not a big fan of changing the name to something more abstract, but could live with it if other people agree with you.

I think that the sanity check is actually a good idea. It won't work if people are using their own parameters, but as long as they are using bricks, it makes sense to check. We perform similar sanity checks for parameters being trained right now in the main loop. On Feb 24, 2015 11:41 AM, "Francesco" notifications@github.com wrote:

Ok, in this case my idea doesn't make sense. :) Sorry for not being more clear from the beginning.

Let me check if I understood the terminology you are using:

  • Configuration: set part or all of the attributes of the Brick object. Can take place when the Brick object is created, by setting the arguments of the constructor, or at a later time directly setting the object's attributes. No Theano variable is created in this phase.
  • Allocation: allocate the theano shared variables for the parameters of the Brick. The Theano variables created in this way are initialized to a default initialization value (0 at the moment, in the future probably NaN).
  • Application: if not done already, quietly calls allocate(). Then a part of the Theano computational graph is instantiated, linking the input and the output of the Brick through its parameters. Cannot be performed (i.e. results in an error) if the Brick object is not fully configured.
  • Initialization: set the numerical values of the Theano variables allocated for the parameters of the Brick. The user-provided value will replace the default initialization value.

These definitions could be used for #338 https://github.com/bartvm/blocks/issues/338. Please correct content errors, if any, and feel free to rephrase the sentences to improve them if needed.

Basing on the assumption that the previous definitions are correct:

  1. Initialization is a misleading name, as (at least to me) the initialization is performed at allocation time. You might call it Value push to be consistent with the nested objects scenario (if I am not wrong! :D).
  2. You could keep trace of the state of the theano variables (initialized=true/false) in the Brick object and raise an error in the main loop if they have not been initialized. This wouldn't help a user that uses the theano graph outside of the framework, but I guess that in the hybrid Blocks+vanilla theano scenario it can be acceptable to assume that the user should pay more attention.

— Reply to this email directly or view it on GitHub https://github.com/bartvm/blocks/issues/337#issuecomment-75792715.

fvisin commented 9 years ago

All those definitions seem correct to me!

Great! :) I will create a PR for the documentation then.

I wouldn't call initialisation misleading, but I can understand that it could be confusing. It refers to what is often called the "initialisation scheme" i.e. sparse, uniform, Gaussian, etc. with particular settings. While confusing in the context of programming (i.e. we're not initialising an object) it makes sense in the context of neural networks I think. I'm not a big fan of changing the name to something more abstract, but could live with it if other people agree with you.

Maybe something more detailed even if a little verbose, s.a. push_init_scheme() could work?

I think that the sanity check is actually a good idea. It won't work if people are using their own parameters, but as long as they are using bricks, it makes sense to check. We perform similar sanity checks for parameters being trained right now in the main loop.

Ok, great. In this case you might also quietly run push_init_scheme() at allocation time (basically, you could initialize the theano variables to the scheme, if the user provided it, or else to the default NaN/0 value).

This way, if the user wants to modify the initialization scheme after the allocation he can change it and explicitly run push_init_scheme(), but in the general case this is not needed. If you want to make sure the user doesn't forget to call push_init_scheme() you can set a flag whenever the scheme is changed, unset it when push_init_scheme() is called and perform a sanity check in the main loop. Does it make sense?

rizar commented 9 years ago

Hold on guys.

Initialization is a great name. This the stage when brick initializes its parameters. It corresponds to the lines "we initialized parameters of our networks by sampling the from uniform distribution".

Initialization is a misleading name, as (at least to me) the initialization is performed at allocation time.

I am sorry, but for me allocation and initialization of an array are two clearly different things.

bartvm commented 9 years ago

@rizar I agree with your understanding of initialization, but I think that the confusion arises from thinking that it was referring to initializing an object, which is an understandable confusion at first sight.

@fvisin That again runs into the problem I mentioned: We don't know whether the user provided the scheme or not. In order to know that, we would need to know which configuration the initialize method relies on, which we don't. Usually it's weights_init and biases_init, but technically speaking it could be anything, and there is no way to tell from the constructor signature or the class definition. This also makes it impossible to set a flag whenever the initialization scheme changed.

So there is no way to do something "only if the user configured a scheme". The only option we have currently is to create a global setting that has two options:

  1. Always initialize on allocation, crash if it fails.
  2. Don't initialize, expect user to call initialize manually.

Making (1) the default could make sense; it removes the possibility of people forgetting to initialize and wondering why their training is full of NaN. Power users could use (2) and simply set blocks.config.initialize_on_allocate = False when they want to perform parameter initialization manually.

rizar commented 9 years ago

@bartvm , a late answer to one of your early posts: it is true that we used to have a number of lazy only bricks, but as I am cleaning up there are already much less of those. It turned out to be not so nice to have such bricks, for instance they are not easily testable.

I support running initialization right after allocation by default. We can implement this basic logic without any exception first and then gradually make it more flexible, for instance I would enjoy overriding this setting for individual bricks.

fvisin commented 9 years ago

@rizar initialize might be mistaken for the object's initialization. Calling it scheme_initialize or push_init_scheme avoids this confusion, in my opinion, but it might just me. If you think initialize is clear enough, I won't insist.

@bartvm I don't get why you cannot specify which initialization scheme an object expects in the object itself, but at this point this is definitely my limit. I guess I am not familiar enough with Blocks and Theano to have a clear picture of how everything works. I will play a little with Blocks first and come back to this discussion if I have something useful to say afterwards. Thank you for helping me understand things so far!

bartvm commented 9 years ago

@fvisin You could, of course, but it means that we would need to annotate each brick somehow, saying which values were need for the call to initialize. @rizar commented on this:

I am not a big fan of the idea to have a list of parameters required for initialization: this belongs to the realm of automagical things that I do not associate with our "lean and mean" motto.

The issue I was referring to, #31, would actually automatically perform this kind of annotation, which is why I mentioned it.

The issue then is simply deciding the pros and cons: I have to agree with @rizar that requiring each brick to somehow carry information about when it is going to use each part of its configuration would, at least on the short term, not be worth the trouble.

fvisin commented 9 years ago

Ok, I see. I like the idea of having things well defined, even at the cost of more under-the-hood machinery, but I agree that this might not be a priority task in the immediate future. In the meanwhile https://github.com/bartvm/blocks/issues/244 seems a good way to avoid most of the problems of manual initialization.

I don't like the idea of supporting many modes and letting the user choose, I would prefer the developers of the framework to choose a unique design to avoid confusion. The problems I foresee letting the users choose the mode are:

bartvm commented 9 years ago

In that case, we're back to the current solution, because we can't go to initialization on allocation without losing functionality. We can only do this if there's a way to retrieve that functionality through global configuration.

I think you're right about merging code that was written with a different setting would be problematic, and I think that would be a strong argument not to make the changes discussed.

Changing the configuration file is not the only way you can change configuration though, you can also just set it programmatically in a script or in an interpreter session.

bartvm commented 9 years ago

Oh, and let's not forget we can also add the sanity check in the main loop. Closing this for now because it's not an issue that will need action, but feel free to continue the discussion.

fvisin commented 9 years ago

In that case, we're back to the current solution, because we can't go to initialization on allocation without losing functionality. We can only do this if there's a way to retrieve that functionality through global configuration.

I'm not sure, but it seems to me that "partial initialization on allocation" could work without losing functionality, if you allow the object to be re-initialized. Provided that you annotate each brick saying which values are needed for the initialization, you can initialize on allocation either to the provided value, if any, or to NaN. After allocation, the user can then modify the initialization of the parameters by passing an initialization scheme and calling initialize (this time it has to be called explicitly). Again, provided that you have the annotations, you can have a sanity check in the main loop that raises an error if the user modified the initialization scheme and didn't call initialize.

This seems to me the most flexible solution and, at the same time, the one that mostly avoids unnecessary redundancies. But again, I agree that this isn't necessarily a priority at the moment. With the current setting the users will get used to call initialize after allocation, but this would be OK (even if useless) also in the new setting, so there is no harm I guess.

Changing the configuration file is not the only way you can change configuration though, you can also just set it programmatically in a script or in an interpreter session.

Yes, sure. It was more an argument against having a general configuration flag than against allowing different modes.

bartvm commented 9 years ago

With "In that case" I was referring to the case of no global configuration, and no major overhaul of the framework that annotates configuration settings with the stage at which they are required.

As I said in several of my comments, if you allow for this annotation, I agree that it is plausible (which is why I referred to #31 as one way of implementing it). However, it's not perfect either, and requires some thinking on how to handle unintended silent re-initialization. I don't like your proposal of subsequent calls to initialize being a no-op or silently re-initializing; both are dangerous for different reasons.

This is how I would like to see the annotation implemented (rather than what I proposed in #31), and it's something I'd be very willing to consider. I think that beyond the initialization issue, it would allow us to be a lot safer than we are now when calling e.g. allocate blindly hoping that the user has configured everything. If they haven't, the resulting errors are often confusing. All of this could be solved this way, but it would add to the complexity of bricks, forcing them to explicitly define for each configuration when they are needed.

Of course, users could still implement their own bricks without using @lazy at all, so I think it's reasonable. If we could make this example work, I think it would actually be very nice:

>>> linear = Linear(input_dim=10, use_bias=False)
>>> linear.output_dim  # A singleton for non-configured values needed for allocation
AllocationNone
>>> linear.apply(x)
AllocationError: The brick 'linear' cannot be applied yet, because the value(s)
'output_dim' has/have not been set yet.
>>> linear.output_dim = 10
>>> linear.apply(x)
linear_apply_output
>>> linear.params[0].get_value()
[[NaN, ...]]
>>> linear.weights_init
InitializationNone
>>> linear.initialize()
InitializationError: The brick 'linear' cannot initialize its parameters yet, because
the value(s) 'weights_init' has/have not been set yet.
>>> linear.weights_init = IsotropicGaussian()
>>> linear.initialize()
>>> linear2 = Linear(input_dim=10, output_dim=10, weights_init=IsotropicGaussian(),
...                  use_bias=False)
>>> linear2.params[0].get_value()
[[0.0423, ...]]
>>> linear2.initialize()
InitializationError: The brick 'linear' has already been initialized. If you are sure that you want to
re-initialize your parameters, use `force=True`
>>> linear2.initialize(force=True)

The lazy decorator would look something like:

class Initializable(Brick):
    has_biases = True

    @lazy(initialization=['weights_init'])
    def __init__(self, weights_init, biases_init=None, use_bias=True,
                 seed=None, **kwargs):
        super(Initializable, self).__init__(**kwargs)
        self.weights_init = weights_init
        if self.has_biases:
            self.biases_init = biases_init
        elif biases_init is not None or not use_bias:
            raise ValueError("This brick does not support biases config")
        self.use_bias = use_bias  

    ....

class Linear(Initializable, Feedforward):
    @lazy(allocation=['input_dim', 'output_dim'])
    def __init__(self, input_dim, output_dim, **kwargs):
        super(Linear, self).__init__(**kwargs)
        self.input_dim = input_dim
        self.output_dim = output_dim

    ...
fvisin commented 9 years ago

I like your idea!

It might be extended to use a metaclass instead of a function decorator, so that you can override __setattr__(self, name, value) to perform the initialization when a parameter is set.

I don't know metaclasses very well, so I am not completely sure it is possible to achieve what I am proposing. I should think about it more thoroughly, but what do you think of the general idea?

Your example would become something like this:

>>> linear = Linear(input_dim=10, use_bias=False)
>>> linear.output_dim  
AttributeError: Linear instance has no attribute 'output_dim'
>>> linear.apply(x)
AllocationError: The brick 'linear' cannot be applied yet, because the value(s)
'output_dim' has/have not been set yet.
>>> linear.output_dim = 10
>>> linear.apply(x)
linear_apply_output
>>> linear.params[0].get_value()
[[NaN, ...]]
>>> linear.weights_init
AttributeError: Linear instance has no attribute 'weights_init'
# when weights_init is set, the corresponding theano variable is automatically initialized
>>> linear.weights_init = IsotropicGaussian()
# if you set weights_init again, the theano variable is not initialized again
>>> linear.weights_init = IsotropicGaussian()
Warning: 'weights_init' was already set, so it will not be automatically initialized. If you 
want to force re-initialization, explicitly call initialize with `force=True`
>>> linear.initialize()
InitializationError: The brick 'linear' has already been initialized. If you are
sure that you want to re-initialize your parameters, use `force=True`
...

And a very tentative metaclass could be something along these lines:

# metaclass definition
class LazyClass(type):
    def __new__(cls, class_name, parents, attrs):
        # I am not sure how to override a method with a metaclass
        # but it could be something like this:
        # Save __setattr__ as __orig_setattra__
        setattr(attr, __orig_setattr__, attr.__setattr__)
        # Use __lazysetattr__ as __setattr__
        setattr(attr, __setattr__, self.__lazysetattr__(obj))
        # Create and return the object
        cls = type.__new__(mcs, name, bases, dct)
        return type.__new__(cls, class_name, parents, attrs)

    def __lazysetattr__(self, name, value):
        if name is in self.__initialization__:
            # perform theano initialization
            ...
            # raise a warning if the variable is already initialized
            if name in self.__initialized_list__:
                warnings.warn("'" + name + "' was already set, so it will not be "
                                     "automatically initialized. If you want to force "
                                     "re-initialization, explicitly call initialize with "
                                     "`force=True`")
            else:
                self.__initialized_list__.append(name)
            ...
        return self.__orig_setattr__(name, value)

# class
class Linear(Brick):
    __metaclass__ = LazyClass
    # Set the attributes required for the meta-class.
    __initialization__ = ['weights_init']
    # allocation could be managed similarly
    # __allocation__ = ['input_dim', 'output_dim']))

    has_biases = True

    def __init__(self, input_dim, output_dim, **kwargs):
        ...
        super(Linear, self).__init__(**kwargs)