Keras: near-future directions

fchollet commented 8 years ago

Here are a few things I think would be valuable to add to Keras in the near future.

These are mere suggestions. You are welcome to discuss them, contest them, add your own ideas... I would like the development of Keras to be increasingly driven by the community. I don't want to be a bottleneck of development.

I won't be writing code myself, but I will happily give feedback and advice to any contributor who wants to tackle some of these features. I will also keep reviewing and merging PRs.

Here's a list of features. Their focus is on simplicity and user experience.

Automatic tensor shape inference. We should be able to add layers to a Sequential mode or a Graph model without having to worry about input shapes.
Introduction of Input layers as part of the above, to define expected input data dimensions.
Ability to re-initialize layer weights without re-compiling a model.
New API for the Graph container. Some thoughts here. As part of this change I believe we should get rid of the add_input and add_output methods, which are cumbersome. A single add method should suffice.
Thorough checking of assumptions made about user-provided constructor arguments throughout the codebase, and raising helpful and clear error messages when some assumptions aren't respected. Ideally no error will ever be raised by Theano, because we will be catching everything on the Keras side. This will make debugging of Keras models much faster and easier.
Abstracting all references to Theano functionality to a Keras backend. This will allow us to implement non-Theano backends in the near future (for instance, based on Nervana's Neon library), that could be readily plugged into Keras in a seamless way. This will allow Keras to keep up with new advances in symbolic tensor libraries (Neon, CGT, etc...)

elanmart commented 8 years ago

@fchollet How would You like to deal with the add_input method? All unused names in the namespace are considered inputs?

fchollet commented 8 years ago

@elanmart that would be way too implicit. No, it's actually really simple: after we introduce an Input layer (which is necessary for automatic tensor shape inference), you don't need to mark certain layers as inputs via something like add_input.

EderSantana commented 8 years ago

Ability to re-initialize layer weights without re-compiling a model.

Don't forget this also means re-initialize the optimizer (for adam, rmsprop and other moving averages).

Abstracting all references to Theano functionality to a Keras backend.

How would you train models with different backends though? You pick and choose one at a time right?

I would also suggest a better way of dealing with desired output shapes. Right now, Keras implicitly assumes the desired is either a matrix (samples x dim) or a tensor3 (samples x time x dim). I had an image as desired and it took me days (partially my bad though) to realize what was wrong, my (samples x rows x cols) was being reshaped as (samples-rows x cols) without warning. This is really dangerous since the method just compiled and ran like nothing was wrong.

fchollet commented 8 years ago

@EderSantana right, I also think a simplified and more intuitive way of applying loss functions, masks and weights would be welcome.

hugman commented 8 years ago

Abstracting all references to Theano functionality to a Keras backend. This will allow us to implement non-Theano backends in the near future

It would be most important and valuable features in the future.

Also I think we should be able to implement any neural network easily using Keras backend API only without complicated theano codes (for instance, implementing GRU with keras api only like #620 )

fchollet commented 8 years ago

Also I think we should be able to implement any neural network easily using Keras backend API only without complicated theano codes (for instance, implementing GRU with keras api only like #620 )

Yes, the recurrent Container would be great to have. It could really speed up experimenting with different RNN architectures.

pranv commented 8 years ago

First, I'll apologize for not completing the caffe PR yet. I'll try to do it over the weekend.

Maybe some of the things done in Caffe PR could be used in shape inference and inferring input and output nodes?

The very first and bad implementation of non sequential models that I wrote did something like that - fallback to recently added layer if previous node is not explicitly mentioned.

matsuyamax commented 8 years ago

Automatic tensor shape inference

I have started doing this in my fork because current way of using convolutional layers is too difficult. I can do the same with all layers and make a PR.

EderSantana commented 8 years ago

@matsuyamax in your solution can the user still define the shapes himself and continue compatible with the current API?

matsuyamax commented 8 years ago

@EderSantana I don't think compatibility can be achieved if we make input dimensions optional, because they are not keyword arguments. Constructor of convolutional layers looks like this:

def __init__(self, input_dim, nb_filter, filter_length, init='uniform', activation='linear', ...)

Currently we use it like this:

layer = Convolution1D(input_dim, nb_filter, filter_length, init='uniform', activation='linear', ...)

We want to make input_dim optional, to use it like this:

layer = Convolution1D(nb_filter, filter_length, init='uniform', activation='linear', ...)

That's not possible with compatibility.

EderSantana commented 8 years ago

@matsuyamax that wouldn't be the most smooth solution I think. Sometimes we just need a quick layer and skip the lazy-initialization that inferred shapes would impose. What if input_dim is optional and the inferred value has the last word? Like, for example, lets assume the Dense layer

class Dense(Layer)
    def __init__(output_dim, input_dim=None, *args, **kwargs):
        self.input_dim = input_dim
        self.output_dim = output_dim
        self. ...
        ...

        self.initialize()

    def initialize(self):
        if self.input_dim is None:
             raise ValueError("Using lazy initialization, either define `input_dim` or set a `previous_layer`)
        else:
             self.W = self.init((self.input_dim, self.output_dim))
             self.b = shared_zeros((self.output_dim))

    def set_previous(self, layer)
            self.previous = layer
            if self.input_dim is None:
                self.input_dim = layer.output_dim
            elif self.input_dim != layet.output_dim:
                warn("Overwriting `input_dim` of layer {}. The user defined parameter differs from what we inferred".format(self.name))

This way, I would not be forced to define two layers (input and output) in the cases where I need only one of them. What do you think?

matsuyamax commented 8 years ago

This way, I would not be forced to define two layers (input and output) in the cases where I need only one of them. What do you think?

I agree, but your solution breaks compatibility too. You see what I mean?

We could do your solution, but then we would have to implement input shape management in every layer __init__. maybe it would be simpler to do like:

layer = Dense(output_dim)
layer.set_input_shape(input_shape)

or

layer = Dense(output_dim)
layer.input_shape = input_shape

Then we would have a single method common with all layer.

pranv commented 8 years ago

The major problem with shape inference is that the layer objects are created first and then added to the model. So, at initialization, the object has no idea whether it will receive an input dimension or should it raise an error.

The way I did this was to basically split up the current initialization method into two parts and calling the second part (the one where parameters are initialized) after the layer has been added to the model.

Backwards compatibility is possible, the code just gets ugly. input_dim will be the first argument with defaults. We would then have to swap the arguments cleverly. If input_dim is None, shape has to be inferred. If it's not None, it has been provided and the user view would have it as first argument. So you'd have to essentially rotate the first few arguments and input_dim right.

pranv commented 8 years ago

I also think a simplified and more intuitive way of applying loss functions, masks and weights would be welcome.

Loss functions could be reduced to layers.

EderSantana commented 8 years ago

TL;DR Maybe we should expose less OOP and makes things even easier to use.

Loss functions could be reduced to layers.

The more layers we introduce, the longer it will take to write a model and start training it. One of the things I like the most about Keras is that you can actually memorize all the steps to write a model, compile and run the experiment. If we introduce too much boilerplate, like a layer for input, a layer for the cost, lazy only initialization, etc. we are going to be just another Blocks and we will always have to go back to copy paste from the documentation to do anything. I'm not saying Blocks is bad, I used to contribute to it as well. I'm just saying that we don't need another Blocks.

What do you think? I honestly believe that this simple API is a huge PLUS about Keras. That is why it is more popular than Blocks or Lasagne which were started almost at the same time. I'm not saying we should just appeal to the masses, I'm saying that simple is the ultimate sophistication and I believe Keras did it and we shouldn't go back. Maybe the solution is in the direction of less OOP not even more OOP. I also think that progress means making things even easier to use, like an optional lazy initialization or easier to develop RNNs.

For the case of losses, the reshaping and masking should go inside the objective. This would give the user more control when necessary, without making to hidden. For example, the mse cost would just be

def mse(y_true, y_pred):
    return T.sqr(y_true.flatten(ndim=2) - y_pred.flatten(ndim=2)).mean(axis=-1)

In other words the reshaping goes inside the objective, and we never collapse data dimensions with the first samples dimensions. For masking we could have an extra optional parameter

def mse(y_true, y_pred, mask=None):
    if mask is None:
        return T.sqr(y_true.flatten(ndim=2) - y_pred.flatten(ndim=2)).mean(axis=-1)
    else
        cost = T.sqr(y_true.flatten(ndim=2) - y_pred.flatten(ndim=2)) * mask
        return cost.sum(axis=-1) / mask.sum(axis=-1)

where mask would have the same shape as the desired. Sorry for the long comment.

elanmart commented 8 years ago

I think loss layers are a great idea that would add flexibility add make it easier to implement things like hierarchal softmax or NCE cost. I believe they would also simplify masking.

fchollet commented 8 years ago

One of the things I like the most about Keras is that you can actually memorize all the steps to write a model, compile and run the experiment.

I agree. Being able to build networks by heart after using Keras a few times is proof that it does a good job at reducing cognitive load, and we want to keep that. Deep learning should be like playing with Duplo blocks and coloring with crayons.

We can back off from the idea of having Input layers, I agree they seemed quite a bit boilerplatey. But then what would we use?

I guess we could have an input argument input_shape and it could be handled by the super __init__ of the base layer class so we don't have to reimplement logic in every layer like @matsuyamax points out. I would be fine with that. Other ideas?

I think loss layers are a great idea that would add flexibility add make it easier to implement things like hierarchal softmax or NCE cost. I believe they would also simplify masking.

It don't think it's necessary to have object losses, because losses don't have a state, unlike optimizers and layers. Everything you could do with a loss layer, you should be able to do it with a loss function.

Also, if it's a loss layer then it's not a layer. Layers only access the previous layer's input.

For example, the mse cost would just be

Sure, but what of masking and weighting?

Maybe the solution is in the direction of less OOP not even more OOP.

OOP is a big part of what made Keras successful and it is the most appropriate paradigm for deep learning, where modules are stateful (e.g. layers).

EderSantana commented 8 years ago

@fchollet

But then what would we use?

We just have to enforce an input_dim in the first layer. Then everything else follows smoothly. Even when the other input_dim are being inferred. See my discussion with @matsuyamax above.

Sure, but what of masking and weighting?

I showed an example of how to do masking above. This does not need to be repeated on every layer necessarily. We could write a generic decorator that does the masking. If I didn't get my syntax wrong, it should be something like:

def objective_decorator(func):
     @wraps(func)
     def wrapper(y_pred, y_true, mask, sample_weight):
         y_pred = y_pred.flatten(ndim=2)
         y_true = y_true.flatten(ndim=2)
         cost = func(y_pred, y_true)
         cost = cost[sample_weight]
         if mask is None:
             return cost.mean(axis=-1)
         else:
             mask = mask[sample_weight]
             return cost.sum(axis=-) / (mask.sum(axis=-1) + 1e-7)
    return wrapper

With this decorator, we wouldn't even have to rewrite a lot in the objectives we already have. Usage would just be

@objective_decorator
def mse(y_pred, y_true)
    return T.sqr(y_pred - y_true)

OOP is a big part of what made Keras successful and it is the most appropriate paradigm for deep learning, where modules are stateful (e.g. layers).

I agree, we just don't have to force the user to always define classes to be able to use Keras. Sequential and Graph allow us to do that. We can create new cost functions and new inputs without having to define whole new class for it.

EderSantana commented 8 years ago

decorators are not super intuitive tho... Thoughts?

matsuyamax commented 8 years ago

Adding output shape inference to all layers has given me to occasion to dive deeper into the repo. Here are some thoughts about the code in general. I think the code has potential, but several issues need addressing.

All layers should have a set_name method, not just Dense.
core.RepeatVector should use broadcasting, currently it is copying input data and that is inefficient.
Same for convolutional.UpSample2D.
Every layer should specify input/output shape in docstring (like what keras.io does).
All layers need proper docstrings with descriptions of what the layer does, what the arguments are.
Online documentation should mirror docstrings. Otherwise the online documentation risks getting outdated and the docstrings risk not being informative enough (which they currently are).
Reshape should take a tuple as argument.
layers.core is too long. Should be broken down in a few files.
All layers should be importable from a single module, keras.layers.
Why is there a ZeroPadding2D layer but not ZeroPadding1D?
Small inconsistency in MaxPooling1D/2D: poolsize vs. pool_length. Should probably be pool_size.
Convolution1D and Convolution2D should use a stride argument instead of subsample (which is not standard).
Inconsistency in handling of get_output for Convolution1D and Convolution2D: Convolution2D uses a hack for cuDNN bot not Convolution1D.
Tests take too long to run. Most of them are not real unit tests and more like holistic tests. It's good to have both but the time impact of the latter should be minimized.

fchollet commented 8 years ago

We could write a generic decorator that does the masking. If I didn't get my syntax wrong, it should be something like

It would be a rather elegant solution. I'm all for it.

Small inconsistency in MaxPooling1D/2D: poolsize vs. pool_length. Should probably be pool_size.

I would be fine with such a change.

Reshape should take a tuple as argument.

Again, this is fine.

Why is there a ZeroPadding2D layer but not ZeroPadding1D

Feel free to add it.

mmmikael commented 8 years ago

A few comments:

Why is there a ZeroPadding2D layer but not ZeroPadding1D

Feel free to add it.

What about adding support for Convolution3D and ZeroPadding3D as well? I think I have this working in my fork.

In the same fork, I have added support to freeze layers. This now works for Graph as well and the trainable state can be serialized. Maybe this is something to include in the main Keras?

I would also suggest a better way of dealing with desired output shapes. Right now, Keras implicitly assumes the desired is either a matrix (samples x dim) or a tensor3 (samples x time x dim). I had an image as desired and it took me days (partially my bad though) to realize what was wrong, my (samples x rows x cols) was being reshaped as (samples-rows x cols) without warning. This is really dangerous since the method just compiled and ran like nothing was wrong.

I lost quite some time time on that one too. Reshaping should definitely be inside the objective function.

Convolution1D and Convolution2D should use a stride argument instead of subsample (which is not standard).

Another nice addition would be an upsample mode (or "deconvolution"). This is particularly useful for image segmentation (see here and here).

fchollet commented 8 years ago

Another nice addition would be an upsample mode (or "deconvolution").

We have UpSampling1D and 2D layers, I think this is what you are looking for.

In the same fork, I have added support to freeze layers. This now works for Graph as well and the trainable state can be serialized. Maybe this is something to include in the main Keras?

Possibly. You can submit a PR and I will review it.

What about adding support for Convolution3D and ZeroPadding3D as well? I think I have this working in my fork.

Certainly. We already have a PR with Convolution3D undergoing review, you might want to review it to make sure you're on the same page.

holderm commented 8 years ago

Ability to re-initialize layer weights without re-compiling a model.

I would love to see that. I would like to use Keras in for "online" predictions while training is done "offline". Right now I use the save / load model procedure as described in the FAQ but it takes rather long (approx 1 min) to load everything until a prediction on new data can be made. Maybe there is already a better way to do this?

fchollet commented 8 years ago

@holderm : you can already do so. Let's say you have a compiled model in production and you would like to update it (to sync it with an identical model that you've just finished training somewhere else). You can just dump the weights of your newly trained model to HDF5, and load them in your production model without recompiling, just via save_weights and load_weights. This takes a few hundred milliseconds for a large model.

holderm commented 8 years ago

Yes, that's true but assume training and prediction take place in different sessions. First one needs to compile the (previously trained) model and then predictions can be made. Once the model is compiled on the new seassion update of weights (load_weights) works fine, but it would be nice if there is a way to skip the compiling (similar to the discussion here: https://groups.google.com/forum/#!searchin/keras-users/load/keras-users/lKjA7qF3ctY/Mzc16Le4CwAJ )

mmmikael commented 8 years ago

Another nice addition would be an upsample mode (or "deconvolution").

We have UpSampling1D and 2D layers, I think this is what you are looking for.

I was more thinking of a straight "reverse" of the convolution in a single layer like the one in Caffe or MatConvNet. This is more efficient than upsampling + convolution.

fchollet commented 8 years ago

Yes, that's true but assume training and prediction take place in different sessions.

As long as you have a compiled model with the same architecture living in each session, you will be fine. But if you don't, the model re-initialization feature we will add is not going to help you anyway.

fchollet commented 8 years ago

Here is what I think we should go for the Graph API:

No more inputs and input arguments, instead, a single inputs argument that could be a string, a dictionary (multi-io layers), or a list of strings.
A single .add(name, layer, inputs, **kwargs) method. When no inputs argument is provided, the layer is assumed to be an input layer. If is_output is set to True, an output node is created.
let's go with the merge_mode keyword argument way of controlling input merging when the input is a list of layer names.

lemuriandezapada commented 8 years ago

Also untrainable outputs would be nice. Such as returning the output of intermediate layers in the predict() function but without having to actually define a 0 loss function for them and slow down the training procedure.

fchollet commented 8 years ago

Here is what I think we should go for the Graph API:

I am not so sure anymore about this. This will require more design work. We'll post-pone it for now.

pranv commented 8 years ago

If i recall, There was some issues about polymorphism right? With having input being a variety of data types

fchollet commented 8 years ago

If i recall, There was some issues about polymorphism right? With having input being a variety of data types

I think having input support a variety of input modes (name, list of names, dict {output_name: input_name}) would be a much better UX than having separate keyword args for all these.

The thing I am less sure of is the removal of add_input and add_output.

elanmart commented 8 years ago

@fchollet any new thoughts on the Graph API? Maybe input argument shouldn't have a default value, and input would be created only if None is passed?

Also, are there any plans to implement a base class for multi-io layers? I assume that if they were to be implemented in Keras, they would have an outputs property (being a list of Layers) just as Graph does?

keras-team / keras

Keras: near-future directions #754