Merge overlapping model with separate input

navta commented 9 years ago

Hi,

Related to #277, I'd like to create an overlapping model but with the input comes from different place. Is it supported? How can I do this? The following code gives me an error because of the unexpected input.

left = Sequential()
left.add(Dense(784, 50))
left.add(Activation('relu'))

model = Sequential()
model.add(Merge([left, left], mode='concat'))

model.add(Dense(100, 10))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

model.fit([X1_train, X2_train], Y_train, batch_size=128, nb_epoch=20)

TIA

fchollet commented 9 years ago

This should be achievable with a Graph model: http://keras.io/models/#graph

navta commented 9 years ago

I tried Graph model with the following code, but didn't successful

shareddense = Dense(748,10)
graph = Graph()
graph.add_input(name='input1', ndim=2)
graph.add_input(name='input2', ndim=2)
graph.add_node(shareddense, name='dense1', input='input1')
graph.add_node(shareddense, name='dense2', input='input2')
graph.add_output(name='output1', input='dense1')
graph.add_output(name='output2', input='dense2')
graph.compile('rmsprop', {'output1':'mse','output2':'mse'})

The error said:

ValueError: ('this shared variable already has an update expression', (dense2_W, GpuFromHost.0))

How can I use shared node on Graph?

fchollet commented 9 years ago

I'll look into it.

fchollet commented 9 years ago

Having thought about it, I think doing something like that is conceptually problematic. If two nodes are processing different data streams, then they are not the same node. So what you want is weight mirroring, not node duplication. Right?

navta commented 9 years ago

Yes, that is correct. Is there a way to do the weight mirroring?

fchollet commented 9 years ago

Yes, that is correct. Is there a way to do the weight mirroring?

There is no built-in way to do it (at this point). One hack to achieve it would be to do batch by batch training and manually set the weights of the second "shared" layer after each batch.

graph = Graph()
graph.add_input(name='input1', ndim=2)
graph.add_input(name='input2', ndim=2)
graph.add_node(Dense(748, 10), name='dense1', input='input1')
graph.add_node(Dense(748, 10), name='dense2', input='input2')
graph.add_output(name='output', inputs=['dense1', 'dense2'])
graph.compile('rmsprop', {'output':'mse'})

for X1_batch, X2_batch, y_batch in generator():
    loss = graph.train_on_batch({'input1':X1_batch, 'input2':X2_batch, 'output':y_batch})
    graph.nodes['input2'].set_weights(graph.nodes['input1'].get_weights())

(untested)

fchollet commented 9 years ago

Note that maybe a more elegant way to do it would be to replace the weights in both layers with the average of both weight matrices (again, after each batch).

If this solved your problem, I'll close the issue.

navta commented 9 years ago

Sorry for the delay. The problem is that I need to keep the output separated, so I can jointly trained the network when I add more nodes on top of it. For example, a network like the following:

shareddense = Dense(748,10)
graph = Graph()
graph.add_input(name='input1', ndim=2)
graph.add_input(name='input2', ndim=2)
graph.add_node(shareddense, name='dense1', input='input1')
graph.add_node(shareddense, name='dense2', input='input2')
graph.add_node(Dense(20,20), name='dense3', inputs=['dense1','dense2'],merge_mode='concat')
graph.add_output(name='output', input='dense3')
graph.compile('rmsprop', 'output':'mse')

lemuriandezapada commented 9 years ago

is there no way he can just use a convolution layer and then flatten the output? If it's weight sharing you're after convolution should do it. I think there was 1d convolution somewhere.

ilthigore commented 8 years ago

Can you elaborate? I think I need something similar and can't figure out how to do it.

Ideally I want to take a k_N-dimensional input vector, split it into k length N pieces, apply an Nxm matrix A to each piece individually, and then concatenate to get a length k_m vector, where the caveat is that I want the matrix A to be learned by backpropagation, rather than some larger (kN)*(km) matrix. Perhaps this is achievable by convolution layers but I'm very new to this game so any advice would be great.

andychisholm commented 8 years ago

I too need this kind of parallel/weight sharing model.

I've managed to get something working based on the suggestion above of simply averaging weights after each batch. I couldn't get it working in batch by batch training but have had some success implementing a callback:

class WeightSharing(Callback):
    def __init__(self, shared):
        self.shared = shared
        super(WeightSharing, self).__init__()

    def on_batch_end(self, batch, logs={}):
        weights = numpy.mean([self.model.nodes[n].get_weights() for n in self.shared],axis=0)
        for n in self.shared:
            self.model.nodes[n].set_weights(weights)

Then you just duplicate the models/layers within a graph you'd like to share. E.g. adapting the example above:

graph = Graph()
graph.add_input(name='input1', ndim=2)
graph.add_input(name='input2', ndim=2)
graph.add_node(Dense(748,10), name='dense1', input='input1')
graph.add_node(Dense(748,10), name='dense2', input='input2')
graph.add_node(Dense(20,20), name='dense3', inputs=['dense1','dense2'],merge_mode='concat')
graph.add_output(name='output', input='dense3')
graph.compile('rmsprop', 'output':'mse')
graph.fit(..., callbacks=[WeightSharing(['dense1','dense2'])]

This works but doesn't feel like a great solution. It's also quite slow - around 5x slower per epoch in my experiments (w/ GPU). I'd be interested to know if there's a better way to support this.

pranv commented 8 years ago

The params of any practical net will add up to be a large chunk of data. And moving that around will obviously slow down overall training.

A faster solution could be to change the Layer API itself. But that is again not a great solution

andychisholm commented 8 years ago

A more efficient approach which I think is roughly equivalent is to average the gradients.

I.e. i've subclassed an optimizer and overridden get_gradients() - setting the dimensions corresponding to expressions for the shared models to be the mean over the all the shared model grad expressions. Assuming the weights are initialised the same, updates will be the same and weights are tied. I just haven't thought too deeply about how this might impact optimisation.

soumith commented 8 years ago

It is of my opinion that this is something that needs deeper thought and more attention.

Having the ability to share the weight pointers across layers (and across other keras models) will unlock, among a few things:

siamese networks
multi-threaded hogwild training that is a 5 liner and will give much better speedup than leveraging OpenMP optimizations (especially at 6, 12 and 24 core machines).
recurrent cleanup layers in a feedforward network

and opens up flexibility without ugliness, which I presume we both like (Torch and Keras).

pranv commented 8 years ago

@soumith, as I have stated above, I believe this would need significant changes throughout the library. It will be a major undertaking.

soumith commented 8 years ago

@pranv cool, did not know that, I am just a spectator offering opinions (because they are free ha!), the development, decisions and priorities are better taken by members like you and @fchollet ...

fchollet commented 8 years ago

and opens up flexibility without ugliness, which I presume we both like (Torch and Keras).

I agree that it would be very useful (in fact necessary) for certain types of models. It should definitely be supported in the future.

The reason why that's not already the case is that there are fundamental (i.e. induced by Theano) issues with reusing layers with different inputs. It's not clear if we can achieve it without explicit copying/continuous synchronizing of the weights. But we will be looking into it.

pranv commented 8 years ago

@soumith but, this is a really important thing as you have stated. Thanks for bringing it up here!

@fchollet makes the best and final decisions. Eager to see his view.

benjaminklein commented 8 years ago

Can this be solved by having different batch size to different layers? And having merge and split work on the batch size as well?

Then we could feed a layer by 2n samples... then use split to create two batches (each with different n items) and each one will then be processed by other layers...

Tuyki commented 8 years ago

I have a rather quick-and-dirty solution that at least SEEMS to work... Defining a new Dense layer with:

def get_output(self, train=False):
    X = self.get_input(train)
    X0 = X[:,0:self.dim]
    X1 = X[:,self.dim:] 
    output0 = self.activation(T.dot(X0, self.W))
    output1 = self.activation(T.dot(X1, self.W))
    ret = T.concatenate([output0, output1], axis = 1)
    return ret

I concatenate both input vectors in one matrix, split it in two here and multiply them with the shared weight matrix. self.dim defines the splitting point. Is this correct / rational? I'm rather new in Keras and Theano but I really love it and see great potentials here.

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

keras-team / keras

Merge overlapping model with separate input #362