Multiple inputs to RNN?

code-ball commented 8 years ago

Hi,

I have two questions related to the input requirements for RNN layer.

1.) Is it possible to provide two separate inputs (not merged) to RNN layer so that both inputs have its individual weight matrix for calculation of output.

2.) If yes, is it possible to provide one input as embedded and other as normal 1-D feature without embedding?

If the answer is no to any of them, what will be best way to implement either of them? Will adding new layer for RNN suffice or even the training/optimization part needs to be modified.

Thanks

codekansas commented 8 years ago

Is it possible to provide two separate inputs (not merged) to RNN layer so that both inputs have its individual weight matrix for calculation of output.

Could you clarify what you mean by this? Do you want two separate RNNs, or two separate inputs into the same RNN?

If you want the two inputs to have their own RNN weight matrices, you should create two RNN layers. You can merge them later on if you need to. If you want to use the same weight matrices but with two separate, unmerged inputs, you can use the functional API, something like

input_a = Input(input_specs)
input_b = Input(input_specs)
rnn = RNN(rnn_specs)
output_a = rnn(input_a)
output_b = rnn(input_b)

If you wanted input_b to go through an embedding layer in this case, you would have to make sure that the embedding dimensions are the same as the dimensions for input_a (specified in rnn_specs).

code-ball commented 8 years ago

@codekansas: Thank you so much for your response.

I understand the suggestions provided by you but my question is more pertaining to your first point. I will provide an example here (sorry for not putting it earlier).

Example: Consider there are 2 inputs a and b.

a is n-dimensional vector and embedded to k-dimensional vector (call it a_em). b is 1-d and so not embedded.

I want to provide both of these inputs to same recurrent layer so that I get following formation:

h = P_a * a_em + P_b * b + Ph * h(t-1)+b where, P_a and P_b are different parameters.

I hope this helps.

Thanks

codekansas commented 8 years ago

You might have to write a custom layer to do that. Is the last b supposed to be an input or the bias term? If it is just a bias, then K.dot(P_a, a_em) + K.dot(P_b, b) is equivalent to K.dot(P_c, K.concatenate(a_em, b)). So if you merge your inputs a_em and b before the RNN layer you will get this behavior. Make sure to merge them along the correct axis. If it is not a bias, then you will have to write a custom layer.

code-ball commented 8 years ago

@codekansas: Thanks again. This is shaping up to be very helpful and closer to what I want. Yes, last b is bias term and sorry for abusing the notation.

About the merge part, once I merge an embedded input and a normal input, does the RNN by itself learn separate weights for both of them. Or are you saying something like this:

If after merge my input dimension is 28(em) + 2 (other input), = 30, the last two values of the learned weights will belong to those inputs?

Also, one thing which I was able to solve only by Reshape and seems very crude is: How do I merge an embedded input and single value input?

Here is example of what I am trying to do:

a = Input(shape=(1000,), dtype='int32', name = 'a')
a_em = Embedding(output_dim=60,input_dim=1000,input_length=1) (a) #Embedded input
b = Input(shape = (1,), dtype='float32', name = 'b') #non-embedded single value input
merged = merge([a_em,b], mode='concat', concat_axis=-1) will give error.

I my concat axis wrong or I need some other operation? Also, is the concat mode right? Or do I need 'sum'?

Thanks

codekansas commented 8 years ago

In your example from earlier, if P_a is an a_in x out_dim dimensional matrix, and P_b is an b_in x out_dim dimensional matrix, then P_c will be an (a_in + b_in) x out_dim dimensional matrix. Either way, you end up with out_dim dimensions, if you do a x P_a + b x P_b or (a, b) x P_c.

You should look at the documentation for Embedding, though. I think you're misunderstanding how it is used. The embedding layer is used to turn a time-distributed series of indices into a time-distributed series of vectors, i.e. (batch_size, time_steps) -> (batch_size, time_steps, output_dim). In your case, time_steps equals 1000, the shape parameter of Input a. So you're trying to merge a matrix of size (time_steps, output_dim) with a vector of size 1 (your input b) and it doesn't work.

However, I don't think this should really be an issue. If you're looking to ask general questions about how to do something, try StackOverflow. But also read the documentation and try to get a better feel for what you're doing.

karishmamalkan commented 8 years ago

Hi,

I have the same request, but in my case, the second term which turned out to be a bias for codeball is supposed to be another input! Is it possible for me to do this is any way?

codekansas commented 8 years ago

It is odd that you would have an input without multiplying by a matrix. In that case, I don't think there is an easy way to implement it without writing a custom layer.

karishmamalkan commented 8 years ago

I am trying to implement a latent variable RNN. So there is one dimension along time, and the other along the latent variable.

I thought maybe if i could allow the dense layer to increase the number of dimensions i could implement this by adding my inputs using two dense layers to transform, followed by a merge layer to add them. But the dense layer only outputs the same number of dimensions. :(

codekansas commented 8 years ago

I wrote something similar here for an LSTM which might be helpful to look at

karishmamalkan commented 8 years ago

Alright, thanks! I am implementing a new dense to output more dimensions using tensordot.

Hopefully i don't mess up the dimensions! It would be really helpful if you guys can add support for multiple inputs to a layer without having to concat them ^_^

code-ball commented 8 years ago

@codekansas: Thanks for great help. Actually, I understand the Embedding layer but my problem was to add simple input along with embedding layer. As i had mentioned, it can be done with Reshape applied to simple input layer. So from embedding you get an output as (batch_size, time_steps, output_dim) and from Reshape you get normal input turned into similar form. In my case it happens to be (None, 1, 60) for Embed and (None,1,1) for input which now can be merged using your suggestion.

@karishmamalkan: I think you case is slightly different than mine in that you have one time dimension and other latent (embedding) variable. Have you tried using TimeDistributed layer for time dimension and just input latent part to RNN? In case your time dimension are actual time values which you want inside your model, you need to implement new RNN layer. I am not sure how just the Dense layer will help. If you can explain your model with some example, it could be useful to answer your question.

But I agree - support separate non-merge inputs is really needed.

Thanks

Imorton-zd commented 8 years ago

@codekansas Hi, I want to use the same weight matrices but with two separate RNNs. But I don't know the mathematical theory in the shared RNN? Which paper can I refer to? Many thanks!

input_a = Input(input_specs)
input_b = Input(input_specs)
rnn = RNN(rnn_specs)
output_a = rnn(input_a)
output_b = rnn(input_b)

codekansas commented 8 years ago

It depends on what you're trying to do, there's an example in the Functional API docs about figuring out how similar two tweets are here

Imorton-zd commented 8 years ago

@codekansas Yeah, I know this functional API, but I want to know the detail mathematical theory. Is the example from some papers?

codekansas commented 8 years ago

I'm not sure. This paper uses them, I'm not sure if it's what you're looking for though

Imorton-zd commented 8 years ago

@codekansas Thanks so much for your help! This paper is just my want. If there are more similar paper to reference, it will be perfect. I don't know which key word I should type in search engine to get relevant reference, though I have tried "shared deep neural network". To return to the subject, I am studying the example figuring out how similar two tweets. In the post code in keras documentation:

Is the similarity the final output (i.e. predictions = similarity of each tweet pair)? This is a regression task? If right, what is the shape or type of the labels?
Now I have 1000 true tweet pairs, which are positive samples. Then I shuffle these pairs as negative samples. I consider the task as a classification task, and set the true tweet pair as 1, false tweet pair as 0. I print the final output:

[[ 0.50720322]
 [ 0.51244992]
 [ 0.51994187]
 [ 0.51702315]
 [ 0.4999733 ]
 [ 0.51310039]
 [ 0.50785744]
 [ 0.51659429]
 [ 0.5093168 ]]

Am I wrong? Any opinions would be appreciated!

tweet_a = Input(shape=(140, 256))
tweet_b = Input(shape=(140, 256))
# this layer can take as input a matrix
# and will return a vector of size 64
shared_lstm = LSTM(64)

# when we reuse the same layer instance
# multiple times, the weights of the layer
# are also being reused
# (it is effectively *the same* layer)
encoded_a = shared_lstm(tweet_a)
encoded_b = shared_lstm(tweet_b)

# we can then concatenate the two vectors:
merged_vector = merge([encoded_a, encoded_b], mode='concat', concat_axis=-1)

# and add a logistic regression on top
predictions = Dense(1, activation='sigmoid')(merged_vector)

# we define a trainable model linking the
# tweet inputs to the predictions
model = Model(input=[tweet_a, tweet_b], output=predictions)

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.fit([train_a, train_b], labels, nb_epoch=10)
pred_proba = model.predict([test_a,test_b],batch_size=batch_size)
print (pred_proba)

codekansas commented 8 years ago

I'm not sure what other papers use this approach. Question answering seems to be a common use case, you might try looking for things along that line. I believe Siamese networks also use a similar concept.

I think the code you wrote works. I'll give you an overview of a question answering model.

One thing many papers do is minimize cosine similarity between the output vectors. This isn't very natural in Keras (I think the Siamese Network might work? I'm not sure on it's current state) but using a Merge layer can get the right behavior. In your example:

merged_vector = merge([encoded_a, encoded_b], mode='cos', dot_axes=-1)

If you want a different similarity metric besides cosine similarity you can pass a lambda function to mode. With question answering a couple papers have tried to minimize hinge loss, which is

loss = max(0, m - cos(q, a_pos) + cos(q, a_neg))

m: margin
cos: cosine similarity
q: question vector
a_pos: positive training example
a_neg: negative training example

In Keras, you could do this like:

a_pos_sim = merge([q, a_pos], mode='cos', dot_axes=-1)
a_neg_sim = merge([q, a_neg], mode='cos', dot_axes=-1)
merged_output = merge([a_pos_sim, a_neg_sim], mode=lambda x: K.max(1e-6, m - x[0] + x[1]), output_shape=lambda x: x[0])

Margins are usually around 0.01 - 0.2. Then to minimize the hinge loss, you need to pass a custom loss function when you compile your model that minimizes the predicted hinge loss:

model.compile(loss=lambda y_true, y_pred: y_pred)

Your prediction model should have inputs q and a_pos and output a_pos_sim.

I should warn you that I haven't tested any of this code, it's just the general idea of how to do this type of thing. I've written some code that actually does this here.

ylqfp commented 8 years ago

Thanks for @codekansas , I'll try out your code.

JosephCatrambone commented 7 years ago

I hope you'll all excuse me for bumping this year-old closed ticket, but I've still got a point of confusion about training Siamese networks that isn't addressed by the Tweet examples and it seems best suited for this discussion.

If the two input strings are varying in size between examples and differing in size between each other, how does one specify the shape of the tensor? It doesn't seem to make sense to me to set a fixed number of steps, as the label may not apply if only the first ten characters are ingested. Additionally, one input may terminate long before the other is terminated. This doesn't seem possible with the given APIs or might require repeatedly unrolling graphs of varying sizes.

What I have so far is just this. Nothing complicated:

def abs_merge(x):
    return tf.abs(x[0] - x[1])

def abs_merge_output_shape(shapes):
    return shapes[0]

def build_model(hidden_size=128, compile=True):
    xA = Input(shape=(UNROLLED_SIZE, CHAR_INPUTS))
    with tf.name_scope("encoder"):
        l1 = LSTM(hidden_size, return_sequences=False, name="l1_lstm")
        l1a_res = LeakyReLU()(l1(xA))
    xB = Input(shape=(UNROLLED_SIZE, CHAR_INPUTS))
    with tf.name_scope("encoder"):
        l1b_res = LeakyReLU()(l1(xB))
    diff = Lambda(abs_merge, output_shape=abs_merge_output_shape)([xA, xB])
    res = Dense(1, activation=None)(diff)
    model = Model(inputs=[xA, xB], outputs=res)
    if compile:
        model.compile(optimizer='adam', loss='binary_crossentropy')
    return model

(shape=(UNROLLED_SIZE, CHAR_INPUTS)) doesn't seem like the right way to go. I'd love to just be able to provide a long list of pairs of one-hot encoded characters and their match/nonmatch labels.

keras-team / keras

Multiple inputs to RNN? #2470